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PREFACE 


This symposium may well be, in the hind-sight of 
ten years from now, a marked turning point in Computer 
Architecture. With the dissolution of the Spring and 
Fall Joint Computer Conferences, one of the major forums 
for Computer Architecture has been lost. So we have 
begun an annual symposium on Computer Architecture, to 
be rotated from year to year throughout the world. The 
atmosphere of such a symposium should be more suitable 
for the professional interchange of ideas than is 
possible at a large conference. Indeed, from the 
quality of papers that have been submitted to this 
symposium, it is clear that the time is here for a top 
quality symposium. We are pleased to say that the 
papers in this symposium are those that at least two 
reviewers rated in the top category. We are sorry that, 
because of this, a large number of very good papers 
were rejected. However, we have passed these papers, 
together with their reviews, on to editors of journals 
that cover Computer Architecture, for their consider- 
ation. We feel that, to encourage the submission of 
good papers to a symposium, it is desirable for us to 
send those papers that don't happen to fit into a 
session, but are very good papers, to journals for 
further reviewing. 


The papers in the symposium indicate the growth of 
Computer Architecture as a science. Although it is 
difficult to explain the reasoning behind the decisions 
made in an architecture, in particular, the architecture 
of a practical machine, this reasoning is the basis of 
a science. It is too easy to simply show the final 
master-piece, as an artist would do. This is the "Moses 
Complex"', as we call it, where the architecture of a 
practical machine is presented as if it is burned into 
stone, and need not be questioned. Several papers in 
this conference are directed at the reasoning process 
itself. We intend to encourage other authors to focus 
on reasons for the architecture by having an open panel 
discussion at the end of each section. We hope that the 
attendees will emphasize questions on the reasoning 
behind the architecture, and the authors will prepare 
for such questions. If this becomes a tradition in 
this annual symposium, it should orient authors toward 
the scientific explanation of their architectures for 
later symposia. 


Parallel to this emphasis on explaining the 
reasoning, a number of papers in the symposium are on 
description languages. We believe that a widely used 
description language will permit the compression of 
detail so that all of the essential information is all 
there, but does not fill up a large part of the paper. 
We believe that the development of a good description 
language is another cornerstone to the growth of 
Computer Architecture as a science. 


There is a wide interest, as exemplified in several 
papers, in the pedagogy of Computer Architecture. These 
papers show the need for courses which abstract the 
principles of Computer Architecture. There is also a 
trend to introduce more laboratory experience into 
Computer Architecture, to balance the thrust towards 
principles with a tie to the reality of hardware. 


A survey of the session titles shows some of the 
other exciting areas of current research. Some of the 
traditional areas, such as the design of fast arithmetic 
units, have been rather thoroughly researched, although 
some questions are yet unresolved. The current areas 
that are receiving particular attention are the con- 
nection of modular systems and fault tolerant or 


fail-soft processing systems. As a special case of 
modular systems, pipeline and cellular systems are 
receiving continued attention. The growth of LSI, 

and the advent of microcomputers in particular, is 
evoking. considerable.excitement in modular systems of 
all kinds. There are indications that modularity of 
various kinds will provide some useful tools in making 
computers fault tolerant or fail-soft. A while back, 
someone wrote that in the next couple of decades, 
Computer Architecture will not change the computers 
that will be built, that they will differ from present 
computers in that they are faster or have more primary 
memory, and so on. I cannot agree! Driven by the 
user's demands for fault tolerant computing and the 
change in technology towards the use of microcomputers, 
Computer Architecture will have considerable impact on 


the machines that are going to be built over the next 
decade. ak 


This’ sympesium owes a great deal to a number of 
people, whom I wish to recognize. Mike Flynn deserves 
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initiated this symposium. We are no less appreciative 
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MARKOV CHAIN MODELS FOR 
ANALYZING MEMORY INTERFERENCE IN 
MULTIPROCESSOR COMPUTER SYSTEMS’ 


Dileep P. Bhandarkar? 
Samuel H. Fuller 
Carnegie-Mellon University 
Pittsburgh, Pennsylvania 


ABSTRACT 


This paper discusses various analytical techniques 
for studying the extent of memory interference ina 
multiprocessor system with a crosspoint switch for pro- 
cessor-memory communication, Processor behavior is 
simplified to an ordered sequence of a memory request 
followed by an interval of processing time, The system 
is assumed to be bus bound; in other words, by the time 
the processor-memory bus completes servicing a proces- 


sor's request the processor is ready to initiate another 
request and the memory module is ready to accept another 


request. The techniques discussed include discrete and 
continuous time Markov chain models as well as several 
approximate analytic methods. 


1. INTRODUCTION 


Carnegie-Mellon University is currently in the pro- 
cess of constructing a multiprocessor colpiiter system 
(Cc, me? that will have up to 16 central processors 
(Pc's)3 sharing the same physical address space (4) and 
concern has been expressed about the performance of 
such a system with thesemany active processors, In ad- 
dition to the processors, there is a set of memory mod- 
ules that are able to operate independently; little 
would be gained if all the processors had to wait for 
service from a single memory module, Between the pro- 
cessors and the memory modules (Mp's) is an bym 
switch. There are a number of ways of implementing the 
switch, but C.mmp employs a full n by m crosspoint 
switch as shown in Figure 1.1. Other multiprocessors, 
although limited to a smaller number of Pc's, also ba- 
sically use a crosspoint switch, e.g. the Burroughs 
D825 and the Univac 1110. For further discussion of 
crosspoint switches, and a variety of other switching 
structures, see Bell and Newell (3). 


Mathematical models of computer systems can be 
developed at various levels of abstraction. A large 
number of models for time-sharing systems consider a 
job as a basic unit (cf. 10), and in many models of 
multiprogrammed computer systems the block of instruc- 
tions between 1/0 operations is taken as a basic unit 
(cf. 5). However, in this study a much more detailed 


‘this work was supported by the Advanced Research Pro- 
jects Agency of the Office of the Secretary of Defense 
(F44620-73-C-0074) and is monitored by the Air Force 
Office of Scientific Research. 


20, P, Bhandarker is now with Texas Instruments Inc., 


Dallas, Texas. 


3We use the PMS notation of Bell and Newell (3) in this 
report to describe hardware organization. 


FIGURE 1.1 


mXn Crossbar Switch 


model is used to analyze interference as processors 
access individual words from the memory modules. Each 
processor's performance is measured by the number of 
memory accesses per unit time. The major contribution 
of this paper is a systematic method for a discrete 
Markov chain model. Other techniques described include 
Strecker's approximation (13), systems with exponenti- 
ally distributed memory service time, and a diffusion 
approximation. 


2. GENERAL MODELING ASSUMPTIONS 


Due to the complexity of the problem, the exact 
detailed behavior of memory interference in a multipro- 
cessor system is difficult to model. We make the fol- 
lowing assumptions with respect to the parameters that 
characterize the behavior of a Pc. 


Instruction mix: In general, processor behavior 
varies for different instructions. However, in this 
paper differences in instructions are ignored, Proces- 
sor behavior is modeled as an ordered sequence of a 
memroy request followed by an interval of execution 
time. At this level of abstraction no distinction is 
made between the processing needed to decode an instruc- 
tion and the processing corresponding to its execution. 
Thus, the processing time characterizing a Pc depicts 
only the aggregate behavior of the real Pc. Figure 2.1 
depicts the actual and abstracted behaviors. 


Processing time of Pc: The models discussed here 
assume that the multiprocessor systems are bus bound, 
i.e. the Pc is ready to initiate the next request and 
the Mp module is ready to accept the next request at_ 
the time the Pc-Mp bus recovers from the current ac- 
cess, The analysis is also applicable to multiproces- 
sor systems in which the effective processing ESues tp, 
is equal to the memory rewrite time, tw. 


Access pattern of a Pc: This is the sequence of 


memory locations accessed by the Pc. In this study 
serial correlation between successive memory accesses 


FIGURE 2.1 


a. An Example of the Timing of a Typical Instruction 
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Legend: 

1 instruction fetch ta memory access time 
2 instruction decoding tw memory restore time 
3 operand fetch td instruction decode time 
4 instruction execution tei processor execution time 
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next instruction fetch 


b. Simplified Processor Behavior. Two such 


cycles model the instruction shown in 
Figure 2.la, 
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Mp access Pc ready 
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will be ignored. Demand patterns will be modeled as 
sequences of Bernoulli trials. Memory accesses will be 
characterized by the memory units to which they are ad- 
dressed, 


Primary memory behavior: Memory performance is a 


function of the fabrication technology, i.e. core or 
semiconductor. It can be characterized by the access 
time (ta), rewrite time (tw), and cycle time (tc). 
Nominally, the cycle time is the sum of the other two. 
In this study, no distinction is made between read and 
write operations, 


3. CONTINUOUS TIME MARKOV CHAIN MODEL 


Consider a multiprocessor system which consists of 
n Pc's and m Mp's connected by a single crosspoint 
switch. Let P,, denote the probability that the i-th 
processor requests service from the j-th memory unit. 
A processor is queued if it is waiting for or. in the 
process of receiving memory service and it is active if 
it is currently being serviced by a memory. Likewise, 
a memory is said to be occupied or busy if there is at 
least one processor queued for that memory unit. 


In this first model, we apply the classic simplify- 
ing assumption in queueing theory: we model the service 
time, or cycle time, of the memory modules as exponen- 
tially distributed random variables. Clearly most 
memory systems do not have an exponentially distributed 
cycle time. However, techniques such as interleaving, 
cache memories, and the type of memory access (read, 
write, read-modify-write) suggest that this exponential 


assumption may be as good an approximation as the 
assumption that the memory cycle time is constant. 
Without further assumptions or approximations, we can 
use the results of Jackson (7), and Gordon and Newell 
(6), to find the performance of the multiprocessor 
system. This technique is also used by McCredie (9) 
for multiprocessors with tp > tw. 


Let the number of service centers be m. The 
states of the system are m-dimensional vectors with 
non-negative integer components, the j-th component 
representing the queue length at center j. _If 
K=(k, sk, 5+.-,k_) is a state vector, then S(K)= Bok, 

L= 


Transition from one center to another is characterized 
by a routing probability R,;, i.e. the probability of 
going to center j on completion of service at center 
i. Jackson (7) has obtained the equilibrium joint 
probability distribution of queue lengths for a broad 
class of queueing-theoretical models representing a 
network of service centers. Customer arrivals are 
modeled as a generalized Poisson process whose mean 
arrival rate varies almost arbitrarily with the total. 
number of customers already in the system, Service 
completions at each center are also modeled as general- 
ized Poisson processes, the mean service rate, y, at 
each center varying arbitrarily with the queue length 
there. 


For closed queueing systems, Jackson's formulae 
reduces to 


P(K) = w'(K)/T'(S(K)) 
where 


k, 
- mt J 4 
w®= 7 gp SW 
j=l ist » 
m 
where e(j) = Ye(i)R. je{1,m] 
i=l *e 
T' (kK) = sw! (RK) summed over all K with 


S(K) = n, 


But, with Pc requests distributed uniformly and with 
the bus-bound situation, or tp=tw, Jackson's model 
simplifies to m servers with customers circulating with 
uniform routing probabilities, i.e. R. .=P. =1/m. 
Using the above formulae we get, a 


wk) = >" 
U 


-\,1 
7@® =(ey\b" 
> ntm = =I > ae 
P(K) = { ‘| for all K such that ‘tk.=n, 
m-=1 i=tt 


i.e. all the states of the system are equally likely. 
Physically, this indicates that states with greater 
congestion in the queues are as likely as evenly dis- 
tributed queues. The probability that a particular Mp 
module is idle, Pr{Mp[i] is idle}, is the fraction of 
the total number of states that has k.=0. In other 
words, - 


Prob {Mp[i] is idle} = 


number of ways of assigning n Pc's to m-1 Mp's 


number of ways of assigning n Pc's to m Mp's 


gue 
a m= <7 n 
oa ~ ohm 
m-1 


m 
© Prf{Mp[i] is busy} 
i=] 


E{number of busy Mp's} 


m*n/ (mtn-1) 


The above expression has a number of interesting 
properties: the expression is symmetric inm and n; it 
has a basic hyperbolic form, asymptotic to n as m gets 
large; and, if we let m=n the above expression becomes 
n/(2-1/n) and 


lim E{number of busy Mp's} > n/2, 


n~7o 


The final observation has important implications. 
It states that as multiprocessor systems grow to include 
more and more Pc's, we are not faced with a law of di- 
minishing returns: no matter how many Pc's are used, 
if we have the same number of memory modules we can 
expect half the processors to be active, 


4, A SIMPLE DISCRETE MARKOV CHAIN MODEL 


For this analysis let us assume that all the Pc's 
are characterized by a single constant processing time 
tp. Also, all the memory units are assumed to have the 
same cycle time tc and access time ta, Thus, the mem- 
ory rewrite time is given by tw=tc-ta, If tp=tw then 
all memory units can be considered to be operating 
synchronously. Thus, during any memory cycle the num- 
ber of active Pc's is equal to the number of busy Mp's. 


In this section a simple Markov Chain analysis is 
presented for the case in which the processors request 
every memory with equal likelihood. The state of the 
multiprocessor system is defined by a m-tuple where 

. k, ian and Os<k, {sn for all i. The number of distinct 
i=] 
states of the system is given by the combination, 
ntn-1 

m-1] 
be assigned to m bins (4). However, since all the pro- 
cessors behave identically, a number of the distinct 
states are equivalent, i.e. they have the same occu- 
pancy and have the same components, e.g. states (2,1,1), 
(1,2,1), (1,1,2) are equally likely. Thus, the re- 
duced states are given by the different ways in which 
the number n can be partitioned into m parts, The 
number of such partitions (for nsm) is asymptotic to 


et2n/3 


mee 


i.e. the number of ways in which n balls can 


(cf. 2) 


Let the representative state S, denote the set of 
compositions of the number n that yield the same par- 
tition, e.g. the compositions (2,1,1), (1,2,1) and 
(1,1,2) correspond to the partition of the number 4 
which has two 1's and one 2, Further, let S; « be the 
individual compositions of the partition typified by 
representative state S, and S, be that composition 
which has its components arranged in monotonic non-de- 
creasing order, i.e. (2,1,1) for the above example. 


Let X,. denote the probability of a transition 
from S, to! Sie Then, due to the symmetry of the 
problem, 

X, .= cP{Transition from 55 to Sik 


13J Ss. 8; 
> 


Let the m-tuple (k, Ko 5e-6,k ) denote the state 
of the Markov chain. If x is the number of non-zero 


elements in this vector then at the end of the memory 
cycle, x new processors have to be reassigned to memory 


modules, At the end of the current memory cycle the 
queue is characterized by the m-tuple (j 


Jy 5355 2691-95 
where per2 a 


if k, > 0 
i 

0 if k, = 0 
i! 


A new state (2 abo seces he ) is reachable from 
(k, ska geee sk ) if ald only if” 4. 2j. for lsism. If the 
aboee condition is satisfied the probability of the 


state transition is given by 
le 
“ds -d « 
d 
m 
j; 


where d. = - 
< lx 

L63 SS es  ) 

di edoeeeed m 
m m m 

Note that since tk, = Sk =n, Yd. = x. 

. Pets 2 Panes 
i=] i=] i=] 


Thus, we now have a formula for generating the 
transition probabilities. Due to the symmetry of the 
problem it suffices to generate only the transition 
probabilities for the representative class of states. 
All the different ways of obtaining the same partition 
are lumped together to form a reduced state. 


To illustrate a computational ethod. for generat- 
ing the transition probabilities consider an example 
of a 4 by 4 system. The number 4 can be partitioned in 
five different ways: {(4,0,0,0); (3,1,0,0); (2,2,0,0); 
(2315150). ‘Cig sll) hs 


These partitions represent five equivalence clas- 
ses that characterize the state of the Markov Chain. 
Let us consider the state (2,2,0,0). At the end of a 
memory cycle, the resultant partial state is (1,1,0,0) 
with two free processors to be reassigned, Figure 4,1 
shows the different ways in which these two Pc's can 
be assigned, one at a time, to reach a new partial 
representative state, After both Pc's are assigned a 
terminal state is reached. The number on the arrow 
indicates the number of ways of reaching the partial 
or terminal state that the arrow points to. Now the 
number of ways in which a final state can be reached 
from the initial state can be computed by traversing 
the tree, e.g. there are 2x1 ways of reaching (1,1,1,1) 
and (2x2 + 2x3) ways of reaching (2,1,1 >9) from 
(2,2,0,0). 


FIGURE 4.1 
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The use of a tree to generate the transition probabili- 
ties was suggested by F. Baskett and D. Chewning of 
Stanford University. 


It is possible to construct a single tree with FIGURE 4.3 
different pointers for different initial states. Fig- 
ure 4.2 shows a complete tree for a 4x4 system. Init- Steps in the Generation of the Transition Matrix 
ial states are circled. The entire transition matrix 
can be generated by traversing this tree, A conveni- 
ent way of traversing this tree is by using a stack 
which has depth equal to one more than the number of 
Pc's, At each level the stack contains a partial 
state and has a pointer to the initial representative 
state (if any) from which it is derived. The stack is 
initialized to contain the path that leads to the top- 
most final state. For this example the transition 
matrix is shown in Figure 4.3. 


FIGURE 4.2 


Enumeration Tree for a 4 by 4 Multiprocessor System 


lad 
STEP 1: Xij is the number of ways of reaching i from j. 


STEP 2: Xij=_Xtj ( Note that Xij= m*, where x of the : 
a components of j are non-zero) 
= x14 
i 


Final equations to be solved simultaneously : 


P4000 0.25 0.0625 0.000 0.015625 0.015265 P4000 

*5100| | 0.75 043750 0.125 0.187500 0.187500 | | ?3200 | 
?2200 |=] 0.00 041875 0.125 04140625 0.140625 | | P2200 | 
Failo| | 0.00 0.3750 0.625 0.562500 0.562500 | | Porto | 
P1111] | 0.00 0.0000 0.125 0.093750 0.093750 | | Pazzi 

SUBJECT TO p . 


+ = 
4000° ?3100* Pe200* Po100* P1121= 2 


~> 

Proof, Let J = (i, oj Ae ) ig a partial state in 
the tree depicted tn figure -2. Furthermore, let the 
number of non-zero elements in the partial state by y 


| and let > J 
The following theorems can be used to increase i= 


the efficiency of the program that generates the trans- a non-empty queue at the end of a cycle, Ri is a partial 


{ Level 0 Level 1 Level 2 Level 3 Level 4} 
2 eee ea a an nen PE ee ere een 


pan-%. Since one Pc is paweye removed from 


ition probabilities. state that can be reduced from a valid representative 
| state K « (k, yk seoesk ), if and only if the number of 

Theorem 1. There is a one-to-one correspondence be- non-zero elements in Kis x, and x2y. Note that x and 

tween a representative state and a partial state that y are both less than or equal to min(m,n) and 28s =D. 

the representative state reduces to at the end of a : , . 

esile P - If x<y then there is no representative state ra that 


corresponds to the partial state J, If x2y, then the 
representative state is obtained by adding y 1's tg the 


Proof. Let (k,,...,k ) be a representative state. non-zero elements of J and replacing x-y zeros of I by 
The partial state at the end of the cycle is given by 1. At level L, = iy ai “Whevetoee. acc bre minibee-ok 
al 3 3 


(jyodoe++24,) where 
occupied Mp's in = is equal to n-L. § 
ko if k, > 0 


4. = Figure 4.4. shows the average number of busy Mp's 
= 0 if k. = 0 when n=m. The curve has an almost constant slope of 
2 586 for n>4. Figures 4.5 and 4.6 show the effect of 

Since no two representative states are alike and adding a Pc and an Mp respectively on the average nunm- 
m ber of busy Mp's. 
x k =n, it follows that the partial states are dis- 
i=1 
tinct. B&B 


Theorem 2, A partial state.at level L in the enumera- 
tive tree of Figure 4.3 can correspond to a terminal 
state with exactly n-L occupied Mp's. 


& 


For an alternative method for traversing the tree see 
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FIGURE 4.4 5. APPROXIMATIONS 


Multiprocessor Systems with n=m Strecker's Approximation. Strecker (13) has an approx- 


imate closed form solution to the discrete Markov Chain 
model presented here. His approach is equivalent to 
removing the queued processors from all the memory mod- 


: | ules at the end of a memory cycle and reassigning them. 
Thus the state of the system is considered independent 
8 of the state during the last cycle. If we use this 
assumption the distribution of Pc's queued for an Mp 
? follows the binomial distribution: 
é myfla ff AN 
prior} = (PYG) (a) 
where Y is a random variable equal to the number of 
ea Pc's queued for Mp[j] and p, = for all i and j. Thus, 
ij m 
: | Pr{Mp[j] is busy} = 1-Pr{Mp[j] is idle} 
2 T\n 
Leth) 
1 
In other words, the occupancy of Mp[j] is 1-(1--)", and 
1 2 3 6 S$ 6 7 $ 9 0 1 2 += & 15 16 | 


m 
© Pr{Mp[j] is busy} 
j=l 
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Average number of busy Mp's 
w 


Number of Pe'a @ Mimber of Mp's E {no é of occupied Mp : S } 


FIGURE 4.5 


méC1-(1-4)] 
The Effect of Adding a Pc 

Strecker's approximation overestimates the unit execu- 
tion rate, but it is encouraging to note that such a 
simple expression is within 6 to 8h of the exact solu- 
tion of the Markov Chain mode for n/n > 0.75. More- 
over, the expression m*{1-(1-—) ] can be written in an 
exponential form as : 


; | pee 
m* {1 -exp[n* £n(1-—) J} 


and the relaxation time, pan(iy17!, approaches m as 
m gets large, 


Diffusion Approximations. An approximation method 
that has been proposed for the solution of general 
queueing networks is the diffusion approximation (cf. 
8,11). A discrete-state process is approximated by a 
diffusion process with a continuous path. The key 
assumption in such an analysis is that incremental 
changes in the queue lengths are normally distributed. 
This leads to a characterization of the queueing net- 
work by a set of diffusion equations. The accuracy of 
the approximation depends on three factors: (i) ap- 
proximation of a discrete-state process by a time-con- 
tinous Markov process, (ii) choice of proper reflect- 
FIGURE 4.6 ing barriers, and (iii) discretization of the contin- 
uous density function for queue lengths. Surprisingly, 
The Effect of Adding an Mp for the simple discrete Markov Chain model of Section 
4, the diffusion approximation yields a result identi- 
" cal to that with exponential servers derived from 
n= 16 Jackson's formulae, However, the main utility of the 
diffusion approximation in this context is that it can 
be used to analyze the effect of different coeffici- 
ents of variation (ratio of standard deviation to the 
mean) for the service time distribution. 


Average number of busy Mp's 


Mecber of Poe's 


6. CONCLUDING REMARKS 


Table 1 summarizes the characteristics of various 
models that have been discussed in this paper. With- 
out a doubt the simplest model to use is the continu- 
ous time Markov chain model: the average number of 
busy Mp's, or the average number of busy Pc's, is 
simply n*m/(mtm-1) , where n is the number of Pc's and 
m is the number of Mp's. In many cases, however, it 
may be more realistic to model the memory cycle time 
as constant, rather than exponentially distributed, 


Average nuzber of busy Mp’s 


2 3 4 . 3 6 7? 8 9 19 We 12 13 14 18 16 5 
Number of Mp's moo 


and hence we developed the discrete Markov chain model 
in Section 4. Table 2 compares the continuous time and 
discrete time Markov chain models. In practice, it has 
proven useful to view these two models as bounds on the 
performance that will be achieved by the actual system; 
the continuous time Markov chain model is probably an 
overestimate of the variance of memory cycle time while 
the discrete Markov chain model is certainly an under- 
estimate of the variance of the memory cycle time. 


| TABLE 1 | 


Processing Nemory Cycle Analysis Computational | 
Time Time Ease 1 
Discrete Constant Constant Exact Solution is 
Markov Chain tp=tw algorithmic. 
Unwieldy for 
large n. 
Strecker’s Constant Constant Rpproximate Closed form 
Rpproximation sotution, 
Simpte formula. 
Continous Time Exponential Exponential Exact Closed form 
Markov Chain solution. 
Simple formula. 
Diffusion Constant Constant Rpproximate Closed form 


Rpproximation solution. 


Simpte formula. 


Simulation 


Approximate 
Model 


Unwietdy due to 
slow stochastic 


convergence. __ 
Ste ere ans meat et eer - 


TABLE 2 


Expected number of busy menories in one cycle 
Number of Pc’s = 1,2,...,8 (rows) 
Number of Mp’s = 1,2,...,8 (columns) 


Discrete Markov Chain Model 


14,8088 


1.0008 1.80088 1.8088 1.8886 1.0808 1.9908 1.8890 

1.6068 1.5888 1.6667 1.7588 1.8889 1.8333 1.8571 1.8758 

)1.8886 1.6667 2.8476 2.2692 2.4855 2.5854 2.5748 2.6272 
/1,8888 1.7588 2.2781 2.6218 2.8638 3.8365 3.1657 3.2652 
1.8608 1.8888 2.4182 2.8633 3.1996 3.4538 3.6482 3.89193 

1.8686 1.8333 2.5853 3.8378 3.4533 3.7889 4.8415 4.2518 

1.8888 1.8571 2.5751 3.1663 3.6486 4.8418 4.36386 4.6292 

1.8888 1.8758 2.62/74 3.2657 3.8824 4.2521 4.6294 4.9471 


Continuous Time Markov Chain Mode! 


1.8888 1.0088 


1,8008 1.0088 1.8280 1.8028 1.0028 1.8228 
1.6888 1.3333 1.5888 1.6808 1.6667 1.7143 1.7588 1.7778 
1.0088 1.5888 1.8888 2.8888 2.1429 2.2508 2.3333 2.4888 
(1,083 1.6888 2.8808 2.2857 2.5808 2.6667 2.8888 2.9891 
1,8888 1.6667 2.1423 2.5888 2.7778 3.8888 3.1818 3.3333 
1,6888 1.7143 2.2508 2.6667 3.0088 3.2727 3.5888 3.6923 
1.0088 1.7588 2.3333 2.8888 3.1818 3.5808 3.7692 4.9808 | 
1.0888 1.7778 2.4888 2.9891 3.3333 3.6923 4.e8gg 


4.2667 © 


Percentage Difference 


8.0088 6.0080 8.8888 8.0988 8.8898 8.8088 B.eee8 B.BeR8 
@.86898 11.1133 18.8818 8.5714 7.4856 6.4918 5.7671 5.1348 
8.608 18.0818 12.8922 11.8632 11.8645 18.1948 9.3794 8.6488 


Q.8888 8.5714 11.8982 12.7928 12.6798 12.1785 11.5519 18.9853 
8.0868 7.4856 11.8984 12.6882 13.1829 13.1198 12.7844 12.3254 
8.8886 6.4918 18.2119 12.1938 13.1266 13.4412 13.3985 13.1591 
8. 9-/67] 9.3899 11.5687 12.7939 13.4849 13.6218 13.5928 


9.1848 8.65439 18.9196 12.3369 13. 


There are a couple of important considerations in 
the analysis of memory interference in multiprocessors 
that have not been touched on in this paper. The first 
is that many multiprocessors may not be bus bound, or 
tp ‘a tw. For discussion of situations where tp is 
greater or less than tc see [1,13]. Another aspect in 
these models that needs to be examined more closely is 
the assumption that each processor accesses each memory 
module with equal probability. Program behavior, as 
well as the memory management policies of the operating 
system, may have a dramatic impact on these accessing 
probabilities. Measurement experiments are currently 
being designed for C.mmp to collect these processor to 
memory accessing frequencies. 
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ABSTRACT 


This paper describes the interconnection scheme 
devised for an advanced Air Force system concept 
called Distribution Processor /Memory (DP/M) in 
which topologically irregular networks of small com- 
puters are used to perform avionics processing. The 
interconnection scheme involves the use of a combi- 
nation of global and point-to-point busses to handle 
message traffic in predominantly homogeneous sys- 
tems of from 5 to 20 computers. The major features 
of the scheme are the use of biphase bit-serial trans- 
mission, associatively addressed messages, anda 
method for reconfiguration of the point-to-point com- 
munications paths under program control. It is ex- 
pected that the scheme may have general applicability 
to other distributed processing systems, particularly 
other real-time systems employing limited-capability 
processors. 


INTRODUCTION 


The problems involved in interconnecting a multi- 
computer system, particularly when "multi'' means 
three or more, are well known. Tradeoffs in the 
design involve factors such as the cost of busses 
versus their speed, their complexity versus their load 
on the computational resources of the system, their 
reliability and its effect on system reliability, ad infi- 
nitum, This paper presents a particular interconnec- 
tion scheme* developed to fit a specialized environ- 
ment, but one which may have more general applica- 
bility in computer networks. This scheme involves 
the interconnection of processors by a single global 
bus together with a nonregular network of processor- 
to-processor links. These links are switchable to 
allow configuration of a variety of data paths during 
operation. The resulting paths are used both as a 
primary communications medium and as a backup for 
the global bus. An associatively addressed message 
transmission scheme for transfers on both the busses 
provides for intercommunications with little degrada- 
tion of computational capability, even for large (over 
20 processor) systems. 


PROBLEM BACKGROUND 


The DP/M concept is essentially the use of a varying 
number of simple and identical processor /memory 
elements (PEs) to handle a wide range of avionics 
system-processing requirements. System sizes are 
expected to range from five to seven PEs on unde- 
manding missions to over 20 PEs in complex environ- 
ments. Each PE represents memory of 4K words 
and computation rate of about 250 thousand instruc- 
tions per second (KIPS) on avionics problems, so this 
means system capacities will range from 1000 to 5000 
KIPS. It is the job of the interconnection scheme to 
allow this level of modularity and the variability in 
system size by providing efficient communications 
between the components of the system without itself 


becoming an undue consumer of processing resources, 
a reliability handicap, or a costly resource in itself. 


The DP/M avionics processing load is partitioned into 
a number of relatively autonomous functions which 
communicate primarily via an "aircraft state vector" 
of a few hundred bits. These functions are further 
broken down into subfunctions with well-defined 
boundaries and low intercommunications require- 
ments. An example of a major function is flight con- 
trol, which may be separated by axis and by axis sub- 
functions into at least six units, called processes. As 
a test case during DP/M concept development, a very 
demanding environment was hypothesized and broken 
down into approximately 50 individual processes. 
Each process in the decomposition is of low com- 
plexity, with typical requirements of under 150 KIPS 
and 2K memory words. 


In such a decomposition, communication within the 
system is of two distinct types--interfunctional and 
intrafunctional. Including Exec overhead, the former 
is estimated at under 200 thousand bits per second, 
while the latter may be up to 300 Kbits per second, 
The interfunctional transfers are typically short mes- 
sages such as Exec commands and state vector infor- 
mation, while the intrafunctional transfers tend to be 
longer, consisting of data block moves. Interfunc- 
tional transfers involve all processors at one time or 
another, while intrafunctional transfers are localized 
to the few processors in which the function is per- 
formed. 


Physical constraints on the interconnection scheme 
were quite limiting. From the beginning, it was 
determined that the system would be physically dis- 
tributable around the aircraft and that the intercon- 
nection scheme should thus allow this distribution with 
low cost. Also, the software goal was to have maxi- 
mum commonality between systems of different sizes, 
so the interconnection could not change character as 
system size varied. Finally, since the interconnection 
is the major ''central'' system resource, it had to be 
amenable to fault tolerance techniques and provide a 
low-cost redundancy option. A simplifying assumption 
was that the system I/O to sensors, actuators, etc., 
would be handled directly by the processors and not 
through the interPE connections, 


DESIGN APPROACH 


The design of the interconnection scheme proceeded 
simultaneously with the definition of the processing 
elements, the software and the requirements analysis. 
As such, it had ample time for iteration and consider- 
ation, Integrated bussing/processing approaches like 
the Holland machine (2) and the distributed processor 
of Burnett and Kozcela (3,4) were rejected early in 
the work because of software problems, leaving the 
bussing work to proceed almost independently of the 
PE definition. The approaches used by a number of 


*The scheme is a result of work done by the Honeywell Systems and Research Division for Wright-Patterson 
Air Force Base in the development of a Distributed Processor /Memory (DP/M) system to serve general 
avionics processing needs in the late 1970s and early 1980s (1). 
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advanced architectures like the Navy AADC (5) and 
the Burroughs D-machine (6) were considered. These 
were uniformly rejected, however, when the system 
bandwidth requirements became known. It was found 
that, up until the present, design approaches had 
largely been devoted to high-rate intercommunication 
between computers via memory modules, either by 
multiprocessing like the D-machine, or by partial 
sharing of memory like the CDC 6500 and others. 

[An exception to this is IBM's ASP configuration for 
two computers (7). ] The unique characteristics 

of the avionics environment, however, obviated the 
need for massive amounts of shared data and, indeed, 
argued against shared memory approaches for fault 
tolerance reasons (protection of data). 


Another characteristic of the more general-purpose 
approaches was their regularity. In order to handle a 
variety of processing loads, these machines had pro- 
vided very regular interconnection schemes in which 
the access rights of a processor to other processors 
or to memory were largely independent of its location 
in the system. The Solomon (8) architecture is a 
good example of this. In contrast, the DP/M environ- 
ment involved a known and nonregular pattern of inter- 
communications between processes and a general 
level of global (interfunctional) communications. 
Furthermore, except under unusual conditions such as 
reconfiguration to mask failures, the association of 
processes to processors was static, so interprocessor 
communications could be considered irregular and 
quasistatic. These differences, combined with the 
low data rates, the requirement for physical distribu- 
tion, and the requirement for fault tolerance, indicated 
that a new approach to computer interconnection might 
best solve the specific problem to which DP/M was 
addressed. 


The bussing scheme chosen, shown in Illustration 1, 
is a hybrid, combining a global bus visiting each PE 
with a number of point-to-point busses between PEs 
in an irregular pattern. Both busses are bit-serial, 
biphase coded, with data transfer rates of 1 Mbit. 
The global bus is provided for the interfunctional data 
transfers and the local busses for the intrafunctional 
transfers and as a backup to the global bus. A dis- 
tinctive feature of the scheme is that the local busses 
are switchable; each PE includes hardware by which, 
under program control, the busses attached to it may 
be connected to each other, to the PE itself, or may 
be idle. Illustration 2 shows examples of the use of 
this capability. A possible physical interconnection 
is Shown in 2a. Here, the maximum number of busses 
to any PE (exclusive of the global connection) is three. 
A combination of switch settings which configure a 
quasiglobal bus is shown in [Illustration 2b. This is 
an example of what might occur during recovery from 
a failed global bus. In Illustration 2c, a combination 
of switch settings is shown which configures two so- 


ILLUSTRATION 1 
DP/M Hybrid Bussing 


Bus 
Interface 


Point-to-Point 
(Switched) 
Busses 
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ILLUSTRATION 2 
Switchable Bussing Alternatives 


Physical Wiring 


(a) 


"Global" Option 


(b) 


"A ffinity' Groups 


(c) 


called "affinity groups" of PEs which may communi- 
cate independently of the global bus for intrafunctional 
transfers. 


The hybrid approach provides a distinct advantage 
over a Single, possibly faster, global bus. First, as 
system requirements grow and change, the option of 
nonregular point-to-point interconnection is expected 
to allow more cost-effective expansion by requiring 
only useful interconnections. Secondly, the extra 
bandwidth can be concentrated in physically localized 
areas of the system instead of requiring overall high 
bandwidth and, in fact, may result in a very high total 
data transfer rate achieved by simultaneous use of 
many Slow paths. Finally, the two-type approach can 
be used to provide redundancy for fault tolerance as, 
and when, needed rather than on an all-or-nothing 
basis. 


Provision of switchability in the local busses is pri- 
marily for fault recovery and flexibility reasons. In 
case of a processor or local bus failure, relocation of 
processes may be required, negating the effectiveness 
of a dedicated approach to interconnection. Using the 
Switchability, however, an alternative switch pattern 
can be used to provide the same intercommunication 
paths to the now relocated processes. Also, in case 
of a global failure, a quasiglobal bus can be configured 
to handle some or all of the previous global bus traffic. 
In this case, too, the system designer can choose to 
spare the global bus with a complete set of switchable 
busses or he can use some or all of the connections 
primarily intended for intrafunction traffic. As will 
be shown below, the bus hardware supports such 
reconfiguration to the extent that the reconfigured 
interconnection may be totally invisible to the soft- 
ware, 


DETAILED DESIGN 
ADDRESSING MECHANISM 


In order to minimize the overhead involved in process 
relocation within the DP/M system, as well as to 
make the geometry of the system and of the intercon- 
nection scheme invisible to the software, it was 
determined early in the design that physical addressing 


of messages on the communications system was un- 
desirable. Software that was transferred between 
systems of various sizes as well as software operating 
before and after process relocation could not be easily 
provided with enough information to physically address 
its messages. Ina system like DP/M, tables for such 
addressing would have been difficult, if not impossible, 
to maintain during mission phase changes and recon- 
figuration after failure. As an alternative to physical 
addressing, it was decided to place in each PE's bus 
interface enough hardware to support associative 
addressing of messages and to require each transmis- 
sion on a bus to be preceded by a destination "name. "' 
Each process in a PE is required to place in the 
appropriate interface registers a 'name'' by which it 
was known inthe system, The bus interface then, 

has the responsibility of handling a list of these names 
in associative memory fashion, matching message 
traffic on the bus against names and accepting mes- 
sages destined for processes within the PE. 


It was determined further that the names of processes 
tended to be hierarchial in nature; that is, a process 
might be identified as: "Flight Control, Y Axis, 
Stability Augmentation Loop, '' and that it was desirable 
to allow messages to be directed either to a particular 
component process by using its full identification or to 
other levels of the naming "tree.'' To accomplish this 
without requiring each process to specify multiple 
names, destination names transmitted by processes 
were made of variable length and the associative 
matching performed by the interface is on a bit-by-bit 
basis. Thus, the name specified by the process wish- 
ing to receive messages is its full identification, but 
it is given all messages whose specified destination 
matches the name in all transmitted bits. Note that 
this type of scheme allows both one-to-one and one-to- 
many type transmissions. 


TRANSMISSION SCHEME 


Although electrical design of the bus has not begun, 
preliminary work and the results of other work (9, 10) 
indicate that a biphase coding scheme is optimal for the 
low data rates and physical environment foreseen for 
DP/M. The message format on the busses, using a 
biphase coding, is shown in Illustration 3. The first 
bits of the message contain the destination name 
interspersed between Is at even bit times. Following 
the first zero at an even bit time, the remainder of the 
transmission is message content, with no further ''tag"' 
bits. In this way, the variable-length name is uniquely 


ILLUSTRATION 3 
Example Message Format 


BIPHASE 

fT LPL LI LHS LIL LJ) Soe ner 
DERIVED 
INFORMATION 


IDENT TAG 
BITS 


INFORMATION 
BITS 


BEGINNING 
OF MESSAGE 
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delimited with minimum wasted bandwidth. To sim- 
plify the hardware, names are restricted to be less 
than or equal to the PE's word size, currently either 
16 or 24 bits. Following the name, the data trans- 
mission is to be an integral number of words. To be 
compatible with a proposed Air Force multiplexing 
standard (10), the bus clock rate will be 2 MHz, 
yielding a 1M bit raw transfer rate. 


Busses are allocated on a round robin basis, with 
each PE ona bus being provided with opportunity to 
transmit or "pass" in turn. Control passes from one 
PE to another when a PE in control transmits a bi- 
phase synch pulse (a pulse more than one bit-time in 
duration). Each PE has two registers in its bus inter- 
face, a Bus Length register and a Position register 
indicating its position on the bus. Whenever a synch 
pulse is transmitted on the bus, every PE increments 
a Current Control counter containing the bus position 
number of the PE which currently has control of the 
bus. In one PE, this number matches the Position 
register. This PE is in control of the bus, and has 
the option of transmitting a message or passing con- 
trol. To transmit a message, the PE simply begins 
emitting the biphase code as shown in Illustration 3, 
terminating’ the transmission (and its control of the 
bus) with a synch pulse. If it has no transmission 
ready, it simply emits a synch pulse, causing control 
to pass on. Thus the minimum time between trans- 
missions from a PE is the time it takes for control to 
cycle around when every other PE on the bus emits 
only a synch pulse. This latency time is expected to 
be under 5 microseconds per PE, but is highly depen- 
dent on final electrical design, physical separation of 
PEs, etc. After overhead for allocation and destina- 
tion header transmission, the busses are expected to 
provide information transfer rates in excess of 500 
Kbits, a safety factor of more than 2:1 over anticipated 
requirements. 


BUS SWITCH DESIGN 


As part of the study work, a preliminary design for 
the bus switch and interfaces was performed. A block 
diagram of the switch is shown in Illustration 4. The 
PE interfaces to the global bus and to a number of 
local busses via receiver/drivers which resistively 
couple to a balanced pair. The tee in the global bus 

is presumed to be external to the PE, while local 
busses are expected to connect to only two PEs. In- 
Side the switch, the busses are separated into receive, 
transmit, and transmit key signals, which then fan to 
a number of crosspoint switches, shown in the detail. 
The contents of switch control registers control the 
crosspoints to effect the switching as shown in Tables 
1, 2 and 3. 


The PE is provided with two blocks of essentially 
identical interface hardware, one for the global bus 
and one which may be switched onto any of the local 
busses. In normal operation, the crosspoint shared 
by the global bus and the global bus interface is closed, 
while other crosspoints are closed as required. 
Alternative crosspoints are provided, however, to 
allow reconfiguration such as in Illustration 2a. Note 
that by reconfiguring in this way, the software con- 
tinues to use the global communications facility in 
exactly the same way, with only the bits in the switch 
control registers and possibly the bus control regis- 
ters (in the bus interface) being altered. 


As can be seen from the tables, all combinations of 

two and three local busses can be interconnected via 
the crosspoints and buffers, As an example, to con- 
nect local bus A to local bus B, crosspoints one and 

two are closed, connecting A and B via a buffer. If, 
in addition, the PE itself is to be attached to the bus 
thus configured, crosspoint 3 is also closed. 


ILLUSTRATION 4 
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It is obvious that the existence of a bidirectional buf- 
fer circuit is essential to the success of the scheme. 
Several TTL designs of such a circuit have been per- 
formed and a small breadboard has been constructed. 
As design of the system progresses and more infor- 
mation on clocking techniques, etc., is available, a 
full-scale breadboard consisting of several switches 
and interconnecting busses is planned. 


PE INTERFACE HARDWARE 


Block diagrams of the interfaces between the proces- 
sor portion of the PE and the busses are shown in 
Illustrations 5 and 6. Both are provided with identical 
encoding/decoding hardware and name recognition cir- 
cuitry. It should be noted that the provision of four 
name registers and four local bus connection points 
was somewhat arbitrary. The requirements for both 
are expected to be determined more definitively for 
DP/M in later work. 


On the output side, both interfaces provide channel- 
type hardware which allows the software to simply 
specify the memory location of the message to be 
transmitted, the length of the transmission, and the 
number of bits in the destination name. The channel 
then gains control of the bus, transmits the requisite 
header using bits from the first memory word, then 
transmits the remaining memory words as data. 


On the input side, the global interface is provided with 
queueing storage to allow incoming messages to be 
accepted by the processor in FIFO order. Interrupts 
are provided to indicate receipt of a complete message 
and impending queue overflow. In order for the soft- 
ware to determine the destination of the message, the 
destination address as received on the bus is included 
at the beginning of the queued information. 


For the local bus interface, where longer transmis- 
sions are expected, an input channel is provided to 
place the arriving information in a software-specified 
main memory buffer area, The input channel supports 
automatic double buffering of arriving information, 
allowing the processor a great deal of time before data 
is overwritten. Here, as well as in the other inter- 
face blocks, various status flags and interrupts have 
been provided to the processor. 


Fault detection in this preliminary design is performed 
intwo ways. First, a PE which misses its control 
slot on the bus may effectively block all further use of 
the bus by not propagating its synch signal. This is 
detected by a timer in each PE's allocation logic 
which interrupts the processor, indicating ''bus assign 
failure'' if an excessively long period of silence is 
observed on the bus. Secondly, missed bits during 
data transfers on the bus are detected on the incoming 
side by maintaining a modulo word-length count of the 
arriving data. If, at the end of the transmission, this 
count is nonzero, the processor is interrupted. This 
detects any missing bits in the data portion of a trans- 
mission only. Errors in the destination header por- 
tion of the transmission are expected to require soft- 
ware detection, since it is likely that errors will 
cause the message to be either missed by all PEs or 
to be accepted by the wrong one or ones. 


CONCLUSIONS 


Although the result of preliminary work, this inter- 
connection scheme offers several unique features 
which may be of general interest. Among them are: 


1) Use of variable-length, associative ad- 
dressing of inter-PE messages. As 
systems grow in complexity by distributing 
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the computing function, this may become 

a cost-effective way of relieving the 
software of the addressing burden. It 

is quite analogous to the use of symbolic 
rather than absolute addresses in assem- 
bler and HOL programming with the 
extension to dynamic mapping of addresses 
onto hardware. 


2) Provision of flexibility and fault-tolerance 
through the use of switchable intercon- 
nections between processors. This tech- 
nique provides a decentralized switching 
system which can be dynamically adapted 
to the needs of a particular problem phase. 


3) The use of relatively complex hardware to 
reduce significantly the communications 
and control overhead conventionally found 
in multiprocessor and multicomputer sys- 
tems. 


Much validation work remains to be done on this 
preliminary design, but much of interest has already 
been accomplished. Work of an architectural nature 
remains to be done to assess the general applicability 
of this type of computer system, particularly for more 
demanding problems. In Situations where functions 

do not easily decompose with low intercommunications, 
extensions of the concept to higher bandwidth busses 
may be considered. Theoretical investigation into 

the physical and virtual interconnection networks 
possible from switchable busses may be of great value. 
It already appears that heuristic approaches to deter- 
mining interconnection patterns and switch settings 


may be difficult to develop. Certainly, the concept of 


an adaptive system with some intelligence, rather than 
just a system which chooses from interconnection 
templates, is worth investigating. As low-cost mini- 
and micro-computers become available, the potential 
for cost-effective distributed systems appears to be 
increasing, and with it the interconnection problems 
of such systems. 


REFERENCES 


1. Johnson, M.D., etal. All Semiconductor Dis- 
tributed Aerospace Processor/Memory Study. Final 
Report, Volume 2, Air Force Avionics Laboratory, 
AFAL TR-73-226 

2. Holland, John. ''A Universal Computer Capable of 
Executing an Arbitrary Number of Subprograms 
Simultaneously.'' Proc. EJCC, pp. 108-113, 1959. 

3. Koczela, L.J. Study of Spaceborne Multiproces- 
sing. Final Report- Phase 1, Volume 2, 15 April 
1067, National Weronantics and Space Administration 
Electronics Research Center, No. C6-1476. 10/33. 

4. Burnett, G.J., et al. ''A Distributed Processing 
System for General-Purpose Computing.'"' Proc. FJCC, 
pp. 757-768, 1967. 


5. Thruber, K.J., etal. Master Executive Control 


for the Advanced Avionic Digital Computer, Interim 
epee Volume 1: Summary, Honeywell No. Z29506- 


3018, June 1972. 

6. Davis, R.L., et al. "A Building Block Approach to 
Multiprocessing,'Proc. SJCC, pp. 343-349, 1970. 

7. Lorin, H. Parallelism in Hardware and Software: 
Real and Apparent Concurrency. Prentice-Hall, 

pp. 166-176, 1972. 

8. Slotnik, D., et al. "The Solomon Computer,' Proc. 
FJCC, pp. 97-107, 1962. 

9. Barnes, B.P., et al. Application of Information 


Transfer Techniques for Solving the Internal Communi- 


cation Requirements of an Advanced Manned Bomber. 
AFAL TR-72-209. 


10, Proposed Standard for Aircraft Multiplex Data 
Bus, Air Force Avionics Laboratory, Wright-Patterson 
AFB, Ohio, 2 March 1973. 


ILLUSTRATION 5 
Bus Interface (Switched Bus) 


RX 1X TX KEY 


BUS 
ALLOCATION 
LOGIC 


4 LENGTH 
POSITION 


MANCHESTER 
ENCODE mada 


Pl TRANSFER i 


MANCHESTER 
DECODE 
SYNCH 


DATA 


DATA ASSEMBLY 


CHANNEL DATA 
| BASE | CONTROL 
TRANSFER 
4M LENGTH 


DATA ASSEMBLY 
DATA 


COMPARE/MATCH LOGIC 


DATA BUFFER MEMORY mle 
PPL BASEL} CHANNEL ALLOCATION CHANNE ~|,ACCESSN ft 
pe BASE 2 GiC 
NAME BUFFER [#| E LIMIT 2 es 
s A 
i] TANNET = +—|~ - MEMORY e eM sare 
RECOGNITION Poe a (as o ee ACCESS oat 
CONTROL = f=I> oll, “ere - 2 2 LOAD 
ma u < if 9 
ww J =< Oo I >) 
21/9] 6 =| fo. ty uw| ° oO is 
a a = =} uu <t . Ole Oo Liu S 
OClw a S}apOlate Hn 2/a lu am w = 
2/SiSlo| wi Fa nd BS) 2 i a nT re WY 
Z}olaie} it IS} fw | = te =| < aS ta) ta > 
vo re ea 2 ui wf JZ fe JDlaje ols WY a! Je 3} >l a] ud 
Zl a/=2|i” > NOI 1S x~ECHOITE wig ej] ole <]| oT aye = 
t}upor put ° aqlaticlof= ofujat< >{=x WW tf wile oO} ulate a4 
Z2icleazis oO Ol col ad tee | Cieicloa yo = QAla|n xt] aici a 2 
i os 1313 fi. fa 1h h3jw 215 1 19 2 13tw LoGW 
Bus Interface (Global Bus) 
RX ™X TX KEY 


LENGTH pa MANCHESTER DATA 
POSITION ENCODE 


BUS 
ALLOCATION 
DECODE 
SYNCH i TRANSFER 
Ei 


DATA 

CHANNEL i 

BASE CONTROL 
Locic 


M LENGTH 


MEMORY | ot =|. 
_| ACCESST_ |_| 


ALLOCATION con EDT = 
Y—-f CONTRO 
. READ 
POINTER N-WORD 
RAM o 
NAME BUFFER J#¥] 5 TRE a 
re a ee LENGTH SES 
RECOGNITION | |_|e| & : 2 5 LOAD 
CONTROL = — 
“Tf mE S - 2 
- wi >) a . Ww <x 
Lal e Cig (aj ya} oO 
wl I) oS | a a a 
=|. | = Ly a ce = i, = 
al(-j= x Ela a \|< ° 
Slals| Ss zl= ule 2 oO oO 
Wo a T)]4 ~ ole us = 
Aa}oO w S|) zia us _ ” uJ 
qo Ww ol> < S ln 7) a 
~| a < i Tis ow vb <x] we ‘ |< 
ud wi {4) z > uo = w Ee] ole x] oale = 
Slol/=| =) el Sia lo = al< WW aclule oO] al < < 
alujajw] o <q Sia ot & >i= = Ol alm < aja 2 
Zalz|s} o rlol= <] ad & Sl ed 
W734171 71 2 1 141 215 1 142 1 W LOG )W 


16 


LOAD NAME LENGTH 


LOAD NAME LENGTH 


BANYAN NETWORKS FOR PARTITIONING 
MULTIPROCESSOR SYSTEMS 


L. Rodney Goke 
Texas Instruments 
Austin, Texas 
and 
G. J. Lipovski 
University of Florida 
Gainesville, Florida 


1, INTRODUCTION 

Restructurable computing systems using multiple 
miniprocessors are currently of interest and promise 
advantages over large single processor time-shared 
systems for some applications (1-3). The modular 
nature of such systems can offer graceful degradation, 
improved availability, and expandability. Such sys- 
tems to date have generally contained a small number 
of processors and have used one or more switching 
structures based on a crossbar. 

It is now reasonable to expect that the low cost and 
high cost/performance of mass produced LSI micro- 
processors will make systems with much larger num- 
bers of processors practical (4). Modules of other 
resources, such as memory and I/O, might also be 
more numerous in such systems. 

The number of contacts, or switching devices, for 
a crossbar, however, increases with the square of the 
number of connections to it, making it prohibitively 
expensive for very large systems. Since the fanout of 
switching devices in a crossbar increases linearly 
with the connections to the structure, this too can be a 
problem in large systems, especially when expanda- 
bility is not to be limited. Itis thus increasingly de- 
sirable to find structures better suited than the cross- 
bar to partitioning large systems. 

This paper describes a class of partitioning net- 
works, called banyans, whose cost function grows 
more slowly than that of the crossbar and whose fan- 
out requirements are independent of network size. 
Such networks can economically partition the re- 
sources of large modular systems into a wide variety 
of subsystems. Any possible partition can be realized 
by paralleling several networks or by multiplexing a 
single network in a manner to be described later. Re- 
sults will be given indicating that a cost/performance 
advantage over the crossbar can be obtained for large 
systems and that the crossbar can, in fact, be con- 
sidered a non-optimal special case of a banyan net- 
work. Inherent fail-soft capability and the existence 
of rapid control algorithms which can be largely per- 
formed by distributed logic within the network are also 
important attributes of banyans. 

This paper presents fundamental properties and 
preliminary simulation results of banyan partitioning 
networks. A more detailed treatment, including 
proofs of theoretical properties, is reserved for ref- 
erence (5). 


2. PARTITIONING 

The purpose of a partitioning network, as consid- 
ered here, is to partition the resource modules ofa 
system into disjoint subsystems by effectively provid- 
ing a separate bidirectional data path connecting the 
resources in each subsystem. Once connected, the 
resources of a subsystem could communicate by time 
sharing this data path in a manner similar to that used 
in such systems as the PDP-11 (6) and the HP 3000 
(7). 


2.1 CROSSBAR NETWORKS 

The crossbar network shown in figure la is per- 
haps the most straightforward partitioning structure. 
For N resource modules, it contains [N/2] data busses 
the maximum number of nontrivial subsystems possi- 
ble at one time. A subsystem with only one resource 
is trivial because it does not need the structure to 
communicate with itself. This network requires 
N[N/2] bidirectional SPST switching devices, |! each 
of which is connected to N-1 identical devices by a 
data bus. 

Figure lbis a graph representing the same struc- 
ture. This representation is similar to that used by 
Benes (9) and uses vertices to represent data busses 


or links, and edges to represent the switches con- 
Ne an 


Figure 1. Crossbar Partitioning Network 
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| Bidirectional electronic switchin, devices suitable 
for all networks in this paper are discussed in refer- 
ence (8). 
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necting them. Note that the crossbar is represented 
by a biparte graph with an edge connecting each bus 
with every resource module. Graph representations 
will be used with other structures later. 

More specialized crossbar structures have been 
used in a variety of multiprocessor systems (3,10-12). 


2.2 PERMUTATION NETWORKS 

It is possible to build a partitioning network from a 
permutation network by supplying the external links 
shown in figure 2. A permutation network can con- 
nect, in pairs, a set 
of input terminals to 
a set of output ter- 
minals of equal size 
so that any desired 
permutation of inputs 
onto outputs can be 
realized. These con- 
nections allow trans- 
mission in either di- 
rection when bidi- 
rectional switches 
are used in the net- 
work. In the config- 
uration of figure 2, 
the network per- 
mutes the set of re- 
source modules onto 
itself, allowing con- 
nected subsystems to 
correspond to the 
cycles of the permu- 
tation. By choosing 
a permuatation with 
the appropriate 
cycles, any desired 
partition can be con- 
nected. 


Figure 2. Permutation Net- 
work used as a Partition- 
ing Network 


PERMUTATION 
NETWORK 


RESOURCE MODULES 


Pee Ne] 

This result is theoretically significant because it 
implies that an N-terminal partitioning network does 
not need to contain any more contacts than an N-input, 
N-output permutation network. It has been shown that 
when N is a power of 2, such a permutation network 
can be built with as few as 4(N log, N - N+ 1) con- 
tacts (13-15). 

The partitioning structure of figure 2 is of limited 
practical value, however, because of excessive prop- 
agation delay in large subsystems. A signal in the 
data path connecting a subsystem with i resource mod- 
ules may have to propagate through the permutation 
network as many as [i/2] times to reach its destina- 
tion. Each time through, it must propagate through 
as many as (log, N-1) contacts. Control of this struc- 
ture would also be relatively complex and could limit 
restructuring speed. 

3. .BANYANS 

A banyan network, named for the East Indian fig 
tree of somewhat similar structure, is defined in 
terms ofits graph representation. The graph ofa 
banyan is a Hasse diagram of a partial ordering (16) 
in which there is one and only one path from any base 
to any apex. A base is defined as any vertex having 
no arcs incident into it, an apex is any vertex with no 
arcs incident out from it, and all other vertices are 
called intermediates. When used as a partitioning 
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network, the bases are connected to resource mod- 
ules, while the apexes and intermediates are within 
the network. Some examples of banyans are shown in 
figure 3. We use a directed graph representation be- 
cause it is useful for specifying the structure and its 
control algorithms, but the switches represented by 
the edges are still bidirectional. 


Figure 3. Examples of Banyans 


A) IRREGULAR BANYAN B) L-LEVEL. BANYAN 


3.1 TREE-SHAPED CONNECTIONS IN A BANYAN 

In a banyan the data path established to connect the 
resource modules of any subsystem always forms a 
tree rooted at some apex. By definition there is a 
unique path from each base to each apex. A subsys- 
tem is connected by selecting an apex and then closing 
all switches along the path from each desired base to 
the selected apex. Since each path is unique, the re- 
sulting data path forms a tree rooted at the apex. Al- 
gorithms for locating eligible apexes and establishing 
the connections will be presented in section 3. 3. 

Tree-shaped data paths are significant because 
they can afford low propagation delay with limited fan- 
out and because they lend themselves well to the in- 
clusion of priority hardware (17). Propagation delay 
and fanout will be discussed later. Priority hardware 
is desirable in any data path used as a time-shared 
bus in order to resolve conflicts when two or more re- 
sources request bus control simultaneously. Details 
of how priority hardware can be built into a banyan 
network can be found in reference (5). 


3.2 SYNTHESIZING LARGE BANYANS FROM SMALL 
ONES 

Large banyan networks can be synthesized recur- 
sively from smaller ones. Suppose that one has avail- 
able a number of small banyan networks, perhaps 
supplied by a manufacturer as a basic module, and one 
wishes to synthesize a larger network. This can be 
done as illustrated in figure 4a by connecting the 
apexes of some banyans to the bases of others. 

The interconnections of these banyans can be repre- 
sented by a graph, as illustrated infigure 4b. In this 
graph, each vertex represents a banyan network. An 
arc from any vertex V1 to another vertex V2 means 
that one apex of banyan V1 is directly connected to one 
base of banyan V2. We assume that if there are any 
arcs incident into a vertex, then the corresponding 
banyan has exactly one base for each incident arc. 


Similarly, the number of apexes equals the number of 
arcs incident out from the corresponding vertex unless 
there are none. When there are no arcs incident into 
a vertex, the bases of the corresponding banyan be- 
come the bases of the synthesized network. Similarly, 
the apexes of the synthesized network are those of the 
component banyans with no arcs incident out. 


sa Ng eT 


Figure 4. Banyan Synthesis 


A) SYNTHESIZED NETWORK B) INTERCONNECTION GRAPH 


Theorem 1: When banyan networks are interconnected 
as described above, the resulting network will be a 
banyan iff the graph of the interconnections is a banyan 
graph. 

Proof Sketch: There are three ways that a directed 
graph can not be a banyan; one, if it contains a circuit, 
two, if there is more than one path from some base to 
some apex, or three, if there is no path from some 
base to some apex. Since the component networks are 
banyans, any of these conditions in the interconnection 
graph would cause the same condition to exist in the 
graph of the synthesized network, and vice versa. 

This theorem is important because once one or 
more banyan structures are known, these structures 
can be recursively expanded to arbitrarily large sizes. 
The SW structure, discussed later, is based on re- 
cursive expansion of the crossbar, one of the simplest 
banyan structures. 


3.3 CONTROL OF CONNECTIONS 


Figure 5 illustrates how a 
a data path connecting an Figure 5. Set-up 
arbitrarily selected apex Algorithm 


with any desired subset of 
bases can be established in 
two steps. Set-up is facil- 
itated by a single control 
line provided in each link 
of the network. First, a 
"one" signal is broadcast 
baseward from the selected 
apex over the control line, 
as illustrated in figure 5a. 
The signal fans baseward 
at each vertex so that the 
"one'' propagates to all 
bases. This signal sets a 
flip-flop in each intermedi- 
ate and apex through which 


SELECTED APEX 


A) STEP 1 
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it passes. 

In the second step, 
"ones" are broadcast apex- 
ward from each base in the 
desired subsystem, as il- 
lustrated in figure 5b. In 
this step, the signal is 
OR'ed apexward at each 
vertex. As illustrated in 
figure 5c, the desired con- 
nection is made by closing 
every switch that receives 
this signal from below and 
has a set flip-flop in the 
adjacent vertex above. 
These are the links through 
which control signals pro- 
pagated in steps one and 
two. 


(Cont. ) 


Figure 5. 
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B) STEP 2 


As described, this set- 
up algorithm would require 
two steps but only one con- 
trol line in each link. Un- 
like the data lines, this 
control line is always con- 
nected between vertices 
and does not require a bi- 
directional switch for each 
edge of the graph. Switch- 
ing for the control line oc- 
curs at the vertices where 
the signal is either OR‘ed 
up or OR'ed down. 


C) FINAL CONNECTION 


Any apex may be used in connecting the first sub- 


system, but subsequent apexes must be selected so 
that the new connection does not overlap with any ver- 


tex already in use. A two-step search algorithm for 
identifying the eligible apexes is illustrated in figure 
6. In this example, the circled vertices represent 
those already in use, and bases 3 and 6 are to be con- 
nected as a new subsystem. As shown in figure 6a, 
control signals are first broadcast apexward simul- 
taneously from all bases in the desired subsystem and 
are then OR'ed upward using the same control line 
used in set-up. During this step, a flip-flop is set in 
every intermediate and apex which receives this con- 
trol signal and is already in use. 

In the second step, illustrated in figure 6b, the con- 
trol signals from the bases are turned off, and each 


Figure 6. Search Algorithm 
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A) STEP 1. B) STEP 2 


vertex with a set flip-flop broadcasts a''one'', which 
is OR'ed apexward on the same line used in step one. 
All apexes not receiving a ''one'' during this step are 
eligible. Final selection could then be performed by a 
priority circuit attached to the apexes. 

Steps one and two of this algorithm, like those of 
the set-up algorithm, could be combined using a sec- 
ond control line. With four control lines, search and 
set-up could all be combined in one step. 

In the event of a hardware failure, any vertex 
could be effectively removed from the network by dis- 
connecting all data lines to it and by treating it as if it 
were always in use. New connections would then be 
routed around the faulty cell 2. 

3.4 PARALLEL AND MULTIPLEXED NETWORKS 

In partitioning a system, the search and set-up al- 
gorithms are repeated until all subsystems have been 
connected or until no eligible apex can be found. Al- 
though most subsystems might be connected this way in 
practice, a banyan may not always be able to connect 
all subsystems of a partition simultaneously. When 
subsystems are associated with independent jobs, this 
would imply only that the partitioning network be a 
limited resource for which jobs must compete much as 
they do for other system resources. When a subsys- 
tem cannot be connected under existing conditions, the 
associated job could be held in a queue until enough 
other subsystems were dissolved to permit the connec- 
tion. 

If, however, one wishes to simultaneously connect 
more subsystems than can be accommodated with a 
single banyan, there are two solutions. 3 First, sever- 
al banyans can be connected in parallel. The parallel 
networks would function independently but their bases 
would be connected to the same set of resource mod- 
ules. As many subsystems as possible would be con- 
nected in the first network. Those left over would be 
connected in as many additional networks as required. 


The other solution is to multiplex a single network 
so that it periodically rearranges itself to connect first 
one set of subsystems, then another, and so on, so that 
each subsystem has some time slot during which it can 
communicate. A partitioning network, as considered 
here, acts as a rearrangeable set of time-shared 
buses. A resource module attached to the network 
must request and receive control of its bus before 
transmitting data, and must be prepared to wait when- 
ever the bus is not immediately available. Normally 
the bus would be unavailable only when used by other 
resources in the same subsystem; but should it ever 
become temporarily unavailable for other reasons, the 
only effect would be to delay data transmission within 
the subsystem. This situation makes multiplexing pos- 


2This would still require a portion of the control cir- 
cuitry in a faulty cell to function. A slower search 
algorithm that avoids this problem has been described 
by Lipovski (17). Alternatively, a software search 
algorithm could replace the faster hardware algorithm 
in the event of hardware failure. 


—3In some cases it may also be possible to connect ad- 
ditional subsystems in a single banyon by rearranging 
the connections of existing subsystems, but this is 
only a partial solution and will not be considered fur- 
ther here. 


Sible with little or no modification of the resource 
modules. The system need only be designed so that any 
resource not currently connected by the network would 
"see'' it as a busy bus. 

Multiplexing requires that a small amount of mem- 
ory be associated with each switch in the network to 
store the state of the switch during each time slot. 
With LSI this could be done at reasonable cost by asso- 
ciating a small register with each switch and synchro- 
nizing all state changes from a central clock. 

The techniques of parallel networks and multiplexing 
may be mixed to balance cost and performance. 
Whether a network structure is space shared with par- 
allel hardware or time shared with multiplexing, the 
parallel networks and/or time slots share many prop- 
erties and are called layers. The number of layers 
required depends on a number of factors and will be 
discussed later. 


4. L-LEVEL BANYANS 

Next we consider a class of banyans with more reg- 
ular structure and additional useful properties, but 
which is still general enough to include most practical 
designs. 

An L-level banyan is simply a banyan whose ver- 
tices are arranged in levels so that switches, or arcs 
of the graph, can only exist between vertices in adja- 
cent levels. For example, the graphs in Figures 3b, 
5, and 7 are L-level banyans, but 3ais not. There are 
actually L+1 levels of vertices in an L-level banyan. 
They are numbered apexward from 0 to L so that all 
bases are in level 0 and all apexes are in level L. 

Any path from a base to an apex in an L-level ban- 
yan has exactly L arcs; thus the propagation time 
through the network during search and set-up is con- 
stant. Moreover, the propagation delay of data 
through the network cannot exceed that of 2L switches, 
since in the worst case, data must travel from base to 
apex to base. 


4.1 BASE AND APEX DISTANCE 

A base distance function, Bl A B2, can be defined 
on the bases of any L-level banyan specifying the mini- 
mum number of levels up into the banyan a connection 
must extend to connect two bases, Bl and B2. Simi- 
larly, an apex distance function can be defined on the 
apexes specifying the minimum number of levels down 
from the top of the structure a connection must extend 
to connect any pair of apexes. Figure 7 illustrates the 
concepts of base and apex distance. The darkened 
paths represent minimal connections. The connection 
of apexes is presented only as a conceptual aid in ex- 


Figure 7. Base and Apex Distance in an L-Level 


Banyan 
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plaining apex distance and would not actually occur ina is a constant F called the fanout and the number inci- 


partitioning network. dent out from each vertex is a constant S called the 
The definitions of base and apex distance can be ex- spread. We except, of course, the fact that bases 
tended to sets of bases and apexes respectively in the have no arcs incident into them and apexes have non 
same way that point distances are often extended to incident out. 
sets of points. That is, the base distance between any Regular banyans would likely be the most econom- 
two sets of bases Bl and B2 is defined to be the mini- ical to fabricate, because they can be built from a 
mum of all distances bj A b2 such that bj € Bl and number of identical cells, each containing the cir- 
bz € B2. The analogous extention applies to apex dis - cuitry associated with a vertex and the arcs incident 
tance. into it. The fanout and fanin requirements of these 
Theorem 2: In an L-level banyan, let Al and A2 be cells are determined by F andS. The next theorem 
apexes and let Bl and B2 be sets of bases. If shows how the number of vertices, and hence cells, in 
L < (Bl A B2) + (Al V A2), then subsystems Bl and B2 each level of a regular banyan is determined by F, S, 
can be connected without conflict in the same layer and L, regardless of how the levels are interconnected. 
with connections rooted at Al and A2 respectively. Theorem 3: In a regular banyan with L levels, fanout 
Proof Sketch: In order for the tree-shaped connection F, and spread 5S, the number of vertices in any level i 
connecting subsystem B1 with apex Al to conflict with is given by N; = Si FL-i, 
that connecting B2 with A2, the two connections must Proof Sketch: For any given apex, there are FL pos - 
have in common some vertex, V. But V must lie in sible paths from various bases. Since there must be 
some level I such that B] A B2 <I< L- (Al V A2). exactly one path from each base, No = FL, Also, for 
No such I can exist if L< (Al V A2) + (B1 A B2). each 1 sis L, N; = N;_) (S/F). Therefore, 
Theorem 2 not only characterizes a way to avoid N; = FL (s/F)i = sipb-i, 
conflicts, but also suggests ways to enhance network When the fanout of a regular banyan equals its 
performance. There are two potentially useful inter- spread, the number of vertices becomes the same in 
pretations. First, subsystems close to each other each level. In this case we call it rectangular. 
place more stringent requirements on the separation 
of apexes used than do widely separated subsystems, 5. SPECIFIC BANYAN STRUCTURES 


suggesting that closely spaced subsystems are less 
likely to be connected in the same layer. Thus, if it is 
known at design time which resources of a system are 
most likely to be connected, one might improve per- 
formance by gerrymandering the assignment of re- 
sources to bases so that bases most likely to be con- 
nected tend to be closest. An operating system could 
also take advantage of this result by allocating closely 
spaced resource modules to a subsystem whenever 
possible. The amount of improvement thus obtainable 
is not estimated here since this would be highly prob- 
lem dependent, but one can easily contrive extreme 
examples in which more than one layer would seldom 
or never be needed. 


There are two knwon types of regular banyans of 
particular interest, SW and CC banyans. 4+ Special 
cases of these structures have been considered pre- 
viously for a variety of applications. 

A structure graphically equivalent to a CC ban- 
yan with L = 3 and F =S = 4 has been used in the 
"Barrel Switch" of the ILLIAC IV Processing Ele- 
ment (19) to shift 64 bits an arbitrary number of 
places to the left or right. 

SW structures were first proposed for partition- 
ing applications by Lipovski (18). Structures graph- 
ically equivalent to rectangular banyans with F = 2 
had been proposed earlier by Batcher for use as 


Th di : h : £ "bitonic sorters''. (20) A variety of permutation 
SHpe CONC antes PEC pation Congo yee ane eclocronr’ structures have also been proposed which contain 


apexes. The search procedure described earlier lo- special cases of SW banyans as subgraphs 

cates all apexes eligible for connecting a new subsys- (13-15-21)... ‘heen sacl eommonvetructuves ac 
Rese esd eel Sp rice Renae hn crossbars and homogeneous trees are special cases 
Theorem 2 now ee a caer selection crite- ot ih Paes re ee Nae ie abe 
e rote: fCCOtCIng FO tne Hacer emi any new Suneyetem yans as partitioning networks, but this diversity of 


can be connected if we can find some apex sufficiently applications suggest that banyan theory may be use- 
distant from those already in use. Thus apexes most ful AE Othae apenas Be welle 


distant from those in use are the most valuable in the 
sense that they are likely to be eligible for connecting 5.1 SW STRUCTURES 
the greatest variety of subsystems. More subsystems 
might then be connected in a layer by selecting each 
new eligible apex so as to leave as many ''valuable"! 
apexes as possible for subsequent connections. This 
criterion is ambiguous in some cases, but neverthe- 
less is the conceptual basis for a priority rule found to 
improve performance in simulated networks. (5) 


The SW structure is a kind of regular banyan 
produced by recursively expanding a crossbar 
structure in the manner of Theorem I! as illustrated 
in Figure 8. Examples of SW banyans appear in 
Figures 4, 5, and 6. The rules for this recursion 
are as follows: 

1) A one-level SW structure with fanout F and 
spread S is simply a crossbar with F bases andS 


4.2 FANOUT AND SPREAD apexes. 


Parameters specifying the number of arcs incident 


Se 
into and out from the vertices of an L-level banyan not 


only determine the fanout and fanin requirements of 4The term SW has been used in earlier work by 
circuits used but can also specify its size and shape. Lipovski (18). CC is an acronym for Cylindrical 

We define a regular banyan to be an L-level banyan Crosshatch since a CC network can be neatly laid out 
in which the number of arcs incident into each vertex as a crosshatch pattern on the surface of a cylinder. 
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2) An L-level SW structure with fanout F and 
spread S can be synthesized by interconnecting sL-1 
crossbars and F identical (L-1)-level SW struc- 
tures, all with fanout F and spread S. The apexes 
of the SW structures are connected to the bases of 
the crossbars such that the interconnection graph is 
a crossbar. Also we stipulate that each crossbar 
must be connected to every component SW structure 
in the same way; i.e., if it is connected to the ith 
apex of one SW, it must be connected to the ith apex 
of each of the others. The reason for this stipula- 
tion will be explained shortly. 

SS EET TET 


Figure 8. Synthesis of an SW Banyan 


The base and apex distance functions of an SW 
banyan tend to group the bases and apexes respec- 


tively into nested subsets. It is apparent in Figure 
8 that the bases of a synthesized SW banyan may be 
grouped according to the component SW's above 
them, forming a partition with F subsets. It is also 
apparent that any connection between bases in dif- 
ferent subsets must be made through one of the 
crossbars and hence must extend exactly L levels 
into the network. The distance between two such 
bases is thus L. 

Bases within a subset can always be connected 
in the component SW above; so when two bases are 
in the same subset, the distance between them can- 
not exceed the levels of that component banyan, L-1l. 
To determine whether this distance is equal to L-1 
or less than L-1, one can similarly decompose the 
component SW's and partition each subset into F 
smaller subsets. Continuing this decomposition, 
one can obtain L-1 levels of nested subsets such 
that the distance between any two bases is given by 
the level of the smallest subset containing both 
bases... 

Similarly, it can be shown that the apex distance 
function groups apexes into L-I levels of nested 
subsets such that each subset is divided into S 
smaller ones. 

As was stated in section 4.1, the base and apex 
distance functions specify the minimum number of 
levels into the structure that a connection must ex- 
tend to connect two bases or apexes respectively. 
In an SW banyan, these functions also specify the 
maximum number of levels into the structure that 
branching may exist in any such tree-shaped con- 
nection. The stipulation "each crossbar must be 


A consequence of this property is that the con- 
verse of Theorem 2 also becomes true making it an 
if and only if test for conflicts. Thus for the SW 
structure, the criterion of Theorem 2 not only gives 
us a way to avoid conflicts but also a characteriza- 
tion of which apexes and bases can and cannot be 
connected without conflict in a single layer. 


5.2 CC STRUCTURES 
The CC structure is rectangular by definition 
and thus must have SL vertices in each level. Let 
0 l N-1l 


Mia ape. NM cc pe sgniorecteng. x Vg 
i i i 


be the vertices in each level 


of an L-level CC structure, where N = SL. In the 
graph of this structure, there is an arc from a ver- 
tex V! to a vertex VJ _. in the level above whenever 


k k+l 
j = i+mS* (mod N) for some m=0,1, ...,S-1. An 
example of a CC structure is shown in Figure 7. 

To show that this structure is indeed a banyan, 
we note first that the L-level property insures that 
the graph contains no loops and hence is that of a 
partial ordering. To show that there is exactly one 
path from each base to each apex, consider any 
such path from an arbitrary base. In propagating 
from each level k to level k+1, a signal may be 
shifted 0, sk 2sk, Gey OL (S-1)Sk places to the right 
in circular fashion. In propagating through the en- 
tire network a signal may then be circularly shifted 
from 0 to SL-1 places to the right so that there is a 
possible path to each of the SL apexes. Further, 
since there is an S-way branch at each level, there 
are exactly SL such paths from each base, and 
hence one to each apex. 

The CC structure demonstrates that multi-level 
regular banyans can be built without using the re- 
cursive technique of Theorem 1. Also it can be 
shown that the base and apex distance functions of a 
CC banyan differ from those of the recursively de- 
fined SW banyans in that bases or apexes appear to 
be arranged in a circle rather than in nexted sub- 
sets. The distance between two bases or apexes is 
then determined by their separation on the circle(5). 


6. SIMULATION RESULTS 

The number of layers typically required for a 
given banyan to fully partition its bases has not been 
obtained analytically. To obtain an indication of the 
layers required, several rectangular banyan net- 
works were simulated. 

The simulations tested the ability of networks to 
connect randomly selected partitions. First, the 
number of subsystems in a partition was selected 
as a pseudo-random number from I! to the number 
of bases in the network. Then each resource mod- 
ule was assigned to one of these subsystems selec- 
ted at random. The number of modules in any sub- 
system could thus vary and could even be Zero in 
some cases. Subsystems were then connected one 
at a time, placing each in the first available layer 


until the entire partition was realized. All subsystems 


were then dissolved and the procedure was repeated 
for a total of 100 partitions. Details of the simula- 
tions can be found elsewhere (5). 


The average number of layers required to fully con- 
nect these partitions was computed for several sizes of 
rectangular SW banyans, as graphed in Figure 9. With 


connected to every component SW structure in the 
same way'' is included to insure this property for 
apex distance. 
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a fanout of 2 or 4, the average layers required appears 
to grow logarithmically with the number of resource 
modules. With a fanout of 3, this function appears to 
grow more slowly than the logarithm; however, one 
must be cautious about concluding this with only 3 data 
points. Larger networks were not simulated because 
of computer time limitations, but additional simula- 
tions of CC networks and of rectangular SW networks 
with modified setup rules have generally supported the 
observation that the average number of layers re- 
quired grows no more rapidly than a logarithmic func- 
tion of the number of resource modules. 


Figure 9. Simulation Results 
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NUMBER OF RESOURCE MODULES 
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It is also apparent that with other factors equal, 
networks with larger fanouts tend to require fewer 
layers. For example, with 64 bases, the networks 
in Figure 9 required an average of 2.35 layers with 
F=2 and 1.91 with F=4. A similar network with F=8 
required only 1.8 layers. 

In several respects the results in Figure 9 repre- 
sent worst case conditions more severe than those 
likely to be found in actual systems. First, it was 
assumed that in each partition, every resource 
module was assigned to some subsystem; i.e., no 
idle resources. Furthermore, trivial subsystems 
containing only one module were connected with apexes 
in the usual fashion even though this would likely be 
unnecessary in practical systems. These simulations 
assumed also that knowledge of the base distance 
function could not be used to enhance performance as 
suggested in section 4. 1. 

The priority rule used for selecting apexes in the 
simulations of Figure 10 is equivalent to selecting the 
leftmost eligible apex when the network is drawn like 
that in Figure 6. Additional simulations have shown 
that some improvement is possible using the criterion 
suggested in section 4.1. 

The average layers required for fully connecting 
all partitions is a useful performance measure be- 
cause it indicates how much the maximum allowable 
data transfer rate available to each subsystem must 
be degraded when all subsystems use a single multi- 
plexed network. In practical systems, however, it 
may not be necessary to connect all desired subsys- 
tems at once, so that the maximum number of layers 
used could be limited to a small number. For ex-. 
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ample, in the largest network simulated, an SW with 
256 bases and fanout 4, over 87% of the subsystems 
were connected in the first layer and over 99% in the 
first two, even though an average of 2. 39 and a maxi- 
mum of 4 layers were required to connect all subsys- 
tems. It was similarly found that the other simulated 
networks could connect most subsystems in a single 
layer and all or nearly all with two. 


7. OPTIMUM FANOUT IN RECTANGULAR BANYANS 

In this section we will consider two cost/perfor- 
mance functions for rectangular banyans, and will 
show that for each, there is an optimum fanout which 
is independent of network size. 

It follows from Theorem 3 that there are log,N 
levels in a rectangular banyan with N bases. The 
total number of apex and intermediate vertices is then 
N log; N. Each of these vertices has F contacts im- 
mediately below, so the cost of the network in contacts 
is given by 

C, (F,N) =F N logy N. 

Since the worst case propagation delay through the 
network is proportional to the number of levels, the 
cost delay product is given by: 


C> (F,N)=FN log. N, 


This cost/performance measure is especially relevant 
when resources communicate synchronously, allowing 
always for worst case delay. 

Both functions are of the form 
Cy (F,N)=FN logh. N. 
To minimize this with respect to F, we set the partial 
with respect to F equal to zero and solve for F. 


9C (F,N) 
oF 
= Nlogh N+ NF 5 log? N 
_Nin?N NPinPN 
in? F nee 
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Thus the optimum fanouts are e for function C,, and 
e* for Cz. Optimum integer values are found to be 3 
for C, and 7 or 8 for C2. Note that when F = N the 
network becomes a crossbar structure, implying that 
for large N, a rectangular crossbar can be considered 
a nonoptimal special case. 5 

The average layers required for fully connecting 
random partitions was not considered in this optimiza- 
tion because its dependency on fanout is not precisely 
known. The simulation results indicate that somewhat 
fewer layers are required when F is large, suggesting 
that optimum fanouts would be somewhat larger if the 
number of required layers were considered. 


5The crossbar of Figure 1 is not rectangular since 
s-4 and F=N, but its cost function still grows as Né 


and exceeds both C, and C2 of optimal banyans for 
large N. 


One's choice of fanout could also be influenced by 
such factors as packaging constraints and the fanout 
capability of devices used. Further, ina regular 
banyan, F must be a root of N. 


8. CONCLUSIONS 

Regular banyan partitioning networks have been 
described, whose fanout requirements are constant 
with respect to system size, and whose cost function 
grows as N log N rather than N2 of the crossbar. 
Worst case propagation delay grows as log N. Dis- 
regarding fanout problems, the propagation delay of 
data paths in a crossbar is constant; however, that of 
priority hardware used to resolve simultaneous re- 
quests within subsystems would still grow as log N, 
assuming methods similar to (17). . 

Simulation of such networks with up to 256 re- 
source modules has indicated that most subsystems 
of randomly selected partitions can be connected with 
only one or two layers, which might prove adequate 
in many applications. 

In applications where the network cannot be thus 
limited, any partition could be fully connected by a 
multiplexed network. In the simulated networks, the 
average layers required to fully connect random 
partitions appears to grow no more rapidly than log 
N, which still allows a cost/performance advantage 
over the crossbar for large N. 

The simulation results presented here indicate 
that the number of layers required can be small 
enough not to offset the cost advantages of banyans in 
large systems. The networks were simulated under 
artificial conditions that were worst case in several 
respects. Many variations of banyan networks are 
possible, only some of which have been presented 
here. It would indeed be interesting to apply a banyan 
network to a specific system where it could be tai- 
lored to requirements. 


This paper has concerned itself with the use of 
banyan networks for partitioning applications. Con- 
sequently, we have not attempted to compare cost 
performance with networks designed for different 
functions, such as permuting or store-and-forward 
message switching. Itis felt, however, that the 
adaptation of banyan structures for such applications 
warrants further study. 

Theoretical results concerning the behavior and 
structure of banyans can provide insight and suggest 
ways to enhance performance. With increased 
notational complexity, most of the theoretical results 
discussed here for regular banyans, including SW and 
CC structures, can be extended to L-level banyans in 
which fanout and spread may be different for each 
level (5). Since a number of structures proposed 
previously for other applications are special cases of 
banyans or contain them as subgraphs, it is expected 
that banyan theory could also be useful in other areas, 
especially that of permutation networks. 
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Abstract 


Several languages have been developed for or ap- 
plied to the problem of describing digital hardware 
systems. This paper points out some of the problems 
encountered in hardware descriptions, particularly 
where they are distinct from concepts appearing in 
programming languages. 


Introduction 


There are three major goals of a hardware descrip- 
tion language: human comprehension, simulation and 
construction. The requirements imposed by these goals 
are most easily specified in reverse order. The goal 
of system construction requires that the language spe- 
cifically describe the actual hardware needed to build 
the machine. The language need not, however, describe 
the structure of any sub-unit which is available as 
One piece, such as an MSI or LSI integrated circuit. 

A description of the terminal behavior of such sub- 
units may be necessary but their actual construction 
is not of interest to the designer. It must be clear 
from the description where hardware is implicitly 
specified in a description, as in the case of multi- 
plexers on register inputs when the register may re- 
ceive information from several sources. 

The goal of simulation requires an accurate be- 
havioral input-output description of each sub-unit in 
the machine. The behavioral description of sub-units 
need not be jin any correspondence with the structure 
of the sub-units, but it must be in a form which is 
executable by the simulator. A simulation may be re- 
quired to produce more or less detailed results and 
therefore the structural description of the circuit 
might be carried out to different depths before the 
behavioral type of description is employed. In all 
cases, however, one must be able to reduce every ele- 
ment of the description of a system to a behavioral 
description in terms of the language of the simulator. 

The goal of human comprehension is somewhat more 
difficult to define. This goal is first in order of 
importance because the utility of any language depends 
on how easily human designers comprehend and write 
descriptions in the language. As Iverson has pointed 
out in describing APL [1], effective suppression of 
inessential detail is important to human comprehension 
and ease of use. However, the design environment in 
which hardware descriptions are done is quite variable. 
For example, detail which is nonessential to system 
structure if a large scale integration arithmetic and 
logical unit is to be employed becomes essential if 
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the unit is to be constructed out of smaller sub-units. 
In such variable environments, the effective suppres- 
sion of detail seems to depend upon flexible mechanisms 
for implicit substructuring such as are afforded by 
extensible languages. In such an extensible language, 
concise syntactical and semantic constructs can be de- 
fined and later used in a simple form in the descrip- 
tion of the system or a set of systems based on the 
same hardware primitives. 

The authors feel that the syntactic structure of 
a language is important to human comprehension in so 
far as it reflects the logical structure of the thing 
described. The sequencing of statements, block struc- 
turing, if-then-else clauses, and iteration clauses 
are syntactic features of programming languages which 
directly reflect execution time features of the compu- 
tation described by a program written in the language. 
One of the important problems in designing high level 
languages for describing digital systems is that of 
isolating significant logical structure features of 
systems and providing syntactic constructs in the de- 
scription language which accurately reflect these fea- 
tures. 


Parallelism 


Several semantic problems arise in describing the 
structure and operation of a digital computer or other 
digital system. The primary problem is that of: des- 
cribing parallelism in the operation. Digital systems 
consist of a large number of components connected in 
a complex way and operating sequentially in time. In 
order to successfully describe such a system one must 
be able to group elements that are logically connected 
to one another in the electronic circuitry and to de- 
scribe the operation of this sub-unit consisting of a 
set of connected circuits. One must also be able to 
associate groups of steps which take place sequentially 
in time to describe a time-sequence or sub-sequence of 
operations within the computer. The necessity for 
grouping elements both in space and in time gives rise 
to the primary linquistic problems of describing a 
digital computer. We thus wish to consider what prop- 
erties a descriptive mechanism or language must have 
in order to successfully describe digital computers. 

Iverson, Falkoff and Sussenguth have used the APL 
language to describe the hardware of the IBM 360 
series of computers. [2] The primary advantage of APL 
in describing a digital computer stems from the fact 
that it has a large number of primitives which specify 


inherently parallel operations. The primitives in- 
volved are primarily operations on bit vectors. The 
APL primitives have the ability to transfer the values 
of bit vectors from one variable or register to an- 
other, obtain values from subfields of large bit vec- 
tors, and apply certain transformations to the bit vec- 
tors either one bit at a time or over all bits of the 
vector. These concepts are quite natural to a parallel 
computer. They are somewhat less applicable to serial 
computers. The structure of the APL primitives is 
parallel in nature, but the overall structure of the 
language is sequential. Programs in APL consist of a 
sequence of consecutively numbered and executed steps 
as in most other programming languages. It is to be 
expected then that the points at which APL becomes 
strained in describing computer hardware are just 

those points at which large scale parallelism becomes a 
factor in the design. Sub-units which have internal 
sequential structure yet operate in parallel in a ma- 
chine are not handled smoothly by APL. 

A higher degree of flexibility in describing par- 
allel and sequential operations is afforded by the ISP 
language developed by Bell and Newell.[3] By using 
the semi-colon and the semi-colon followed by "'next!! 
properly in an ISP description and by including paren- 
theses in appropriate places a complex structure built 
of sequential and parallel sub-units can be constructed. 
The technique is quite similar to describing a series- 
parallel electrical network. The types of structures 
which cannot be described with the ISP type mechanism 
of a parallel separator and a sequential separator are 
in fact just those cross-linked type structures which 
correspond to bridge-type connections in an electrical 
circuit. The most general case of course is a group 
of nodes representing elementary actions with a partial 
ordering imposed on the nodes. It seems that none of 
the familiar syntactic mechanisms form structures simi- 
lar to general partial orders. (On the other hand much 
execution sequence information is clearly mirrored in 
programming language syntax.) The cross-linked struc- 
tures do not often appear in computer design, and a 
language which offers only the series-parallel mechan- 
ism of description will be quite adequate for a large 
number of applications. 

The syntactic structure of this mechanism can be 
summarized in BNF as follows: : 


<system description> :: = <step>| 


<system description><sequential separator><step> 


<step> :: 


<act ion>| 
<step><parallel separator><action> 


<action> :: = <elementary action>| (<system description>) 


An example using : as the parallel separator, > as the 
sequential separator and EAi as a name for the ith 
elementary action is: 


EAl > EA2 > EA3: EA4: EAS > EA6: (EA7 > EA8) > EAQ 


The partial ordering imposed by the above description 
can be represented by the covering relation diagram in 


figure 1]. 
iy 
EA2 


ay~ 


Ey EA “<a 
EA6 


EA7 


\e 


Figure ] 
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The situation shown in the partial order diagrammed 

in figure 2 where EA2 and EA3 must be complete before 
EA4 starts whereas starting EA5 requires only the com- 
pletion of EA2 cannot be represented by the above 
linquistic mechanism. 


E 
EA2 EA3 
aA 
EAS EA4 
a 
EA6 
Figure 2 


The closest approximation imposes the restriction that 
EA3 precede EA5 which is not required by the original 
structure. 

Notice that there is no timing information present 
in the above descriptions. Only sequence information is 
given as a set of requirements on which actions preceed 
others. Timing is more explicit in a language such as 
Schorr's Register Transfer Language. [4] Using the 
conditional execution feature of Schorr's language the 
sequence information inherent in EAl > ((EA2 > EA4): 
EFA3) > EAS might be expressed by: 


It, | EAL; 1>+t,3 1% t, 
eo) SABA Sol ty, 
It, EA3 ; 1 > tgp 
|t,, | EAR ; 1+ tpy 
Ite, A teal EAS 


The timing is more explicit in the RTL but extra detail 
(the arbitrary ordering of the lines and names for the 
times) tends to obscure the sequence information. It 
seems that the specification of sequence of actions in 
a digital system may be a higher level concept than the 
specification of timing. Note that when a high level 
system description specifies only a partial order on 
actions the tasks of simulation and construction are 
complicated by the need for implicit rules for resolv- 
ing ambiguities in timing. 


Function Types 


Another type of descriptive dichotomy which exists 
in computer hardware description is the need to describe 
sub-units in several different ways. Linguistically, 

a sub-unit may be described in a single statement or by 
a procedure definition, and consists of a set of input 
variables and a set of output variables together with 
some well-defined set of rules for computing the 

values of the output variables from the values of the 
input variables. Two distinct types of functions arise 
in the description of hardware: combinational functions 
and sequential functions. The combinational functions 
may be thought of as statically defined structures in 
the sense that as long as the inputs are constant the 
output is constant (except for propagation delay 
effects), and inputs and outputs are quite distinct. 
The sequential function, on the other hand, has a set 
of parameters associated with it which can be thought 
of as registers. The sequential function is invoked 

at a particular point in time;-it uses the values in 
the input registers to perform some computation in 

some finite number of steps and produces results in the 


output registers, some of which may be the same as the 
input registers. 

There are also two distinct types of function us- 
age. One sort of use involves assembling a separate 
set of hardware according to the specifications set 
forth in the function definition. In this case, the 
definition may be thought of as generic in nature, de- 
scribing many devices of the same structure. The 
second function usage involves using a single hardware 
unit at several places within a system description with 
the separate uses of the unit occurring at different 
times and perhaps involving multiplexed inputs and/or 
outputs. An example of a generic use of a function 
definition would be to describe the structure of sever- 
al similar 16 bit counters in terms of their components. 
Multiple usage of a specific function definition would 
occur if the same adder were used for arithmetic and 
effective address computation. 

One possible way to dissolve this descriptive di- 
chotomy is to relegate generic function descriptions to 
a strictly linguistic mechanism such as text substitu- 
tion macros. If this is done then the appearance of a 
function name with appropriate parameters for input and 
output can be thought of as simply a name for the com- 
putation which takes place within the function defini- 
tion. Specific functions, on the other hand, can be 
represented as procedures or closed subroutines which 
are invoked during the running of the system. Each 
computation begins at the point in time at which the 
procedure is invoked, and results are available after 
the characteristic delay time associated with the se- 
quential device or the system of combinational logic 
and multiplexers. 

It is convenient to allow a sequential function to 
have multiple entry points to clarify the correspondence 
between structural and sequential type descriptions. 
This facility permits the description of sub-units 
which perform several functions. An example of such a 
sub-unit is a shift register which may be shifted left 
by clocking one input, shifted right by clocking an- 
other input, or loaded in parallel by clocking a third 
input. Each of these three clock inputs can be repre- 
sented by a distinct entry point. 


Control 


Yet another dichotomy exists in hardware descrip- 
tion, namely the dichotomy between data and control. 
It is important for the suppression of detail in a high- 
level description to let the control signals be implic- 
itly described by the order in which statements are to 
be executed, but there must be mechanisms for inter- 
action between control and data. After an instruction 
has been decoded, for example, the signals which repre- 
sent the instruction must cause a transfer of control 
to the portion of the description which executes that 
instruction. This is conventionally handled by condi- 
tional statements of one kind or another. It is also 
useful to allow control signals to be treated as data; 
this can be done by introducing the object ''*'' which 
is logical 1 when the statement containing the ''! is 
being executed and logical O otherwise. This concept 
is similar to that of a program counter in a program- 
ming language, but due to parallelism there may be more 
than one of them active at a given time. This facility 
can be used within a language to deal with the problem 
of implicit multiplexing, as follows. The language 
associates with each register R in the description two 
expressions: INPUT (R) and CLOCK (R). These expres- 
sions are obtained by initially setting each of them 
to 0 and then examining every statement in the descrip- 
tion; whenever a register transfer 


R< S$ 
is encountered, INPUT (R) is replaced by 
INPUT (R) +S * 
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and CLOCK (R) is replaced by 
CLOCK (R) + * 


Here + is an operation equal to aVb if a,ab=0 and is 
undefined otherwise. A simple example of this mechan- 
ism is shown in figure 3. 


Translation of Register Transfer Statements to Networks 


Translation of R<+« S 
INPUT (R)=**++SAx+- +> 
CLOCK(R)=:-:+*+... 
where a+b = aVb if aab = 0 and is undefined if ab = | 


TIME ACTION 
Ti R< 5S Ti 
Tj 
Tj R+Q 
S 
Q 
INPUT(R) = TiIASVTjJAQ 
CLOCK(R) = TivTj. 
Validity Condition TiATj=0 


Figure 3 


Levels of Description 


A hardware description language should be appli- 
cable to various levels of description. In particular, 
it is extremely useful at any level to have the 
ability to describe the input/output behavior of some 
"black box'’ circuit without describing its internal 
construction. In this way, portions of the circuit 
may have their descriptions put off until a lower level 
of description is reached. Such a description of the | 
input-output behavior of a ''black box'' should be put 
in the clearest and most convenient form possible. 
some cases, this may mean that the input-output de- 
scription is a sequential description when in fact the 
unit is a combinational unit, or it may mean that the 
description may be combinational while the unit actu- 
ally operatés sequentially. In general, then, it is 
not desirable to require that the structure of the box 
match that of the description, since this internal 
structure is precisely what we are trying to suppress. 
There is probably no need for a separate language for 
the description of the terminal behavior of sub-units. 
The hardware description language itself should be 
flexible enough to provide a clear and concise decrip- 
tion of any possible unit. There should, however, be 
some distinction made between the two kinds of appli- 
cation of the language to clarify whether it is being 
used to describe the actual structure of a sub-unit of 
a machine or merely to specify the input-output be- 
havior of the sub-unit for the purpose of describing 
the activity of the rest of the machine. 

There must also be provided methods for describing 
timing and sequence of sub-unit interfaces when this 
information is not reflected by the behavioral descrip- 
tion. 


In 


Use of Names 


We also wish to consider the role of names in a 
hardware description language. There are three 
classes of objects which may be named in the descrip- 
tion of a machine. These three classes have somewhat 
different properties. One use of names is in the de- 
scription of values stored in registers of the machine. 
These names play roles quite similar to the roles 
played by variable names in programming languages. The 
values associated with the names can be non-destruc- 
tively read and used, and are changed only as a result 
of an action which stores a new value into the register. 
Another possible use of names in describing computer 
hardware is to identify Boolean signals or vectors of 
Boolean signals which appear within the machine. For 
example, consider a bus within a machine which is used 
to transfer values among the registers of the machine. 
The value on the bus may change either because of a 
change in the contents of the register that is current- 
ly multiplexed onto the bus or because of a change in 
the contents of one or more of the flipflops that de- 
termine which register is multiplexed onto the bus. 

The value of the bus is therefore a combinational func- 
tion of the values of several registers. It is useful 
to name the bus jin order to specify transfers of 

values between the bus and registers; such a usage of a 
name illustrates the second role for names. Finally, 
names may be used to designate system modules (either 
generic or specific) the behavior of which involves a 
mixture of both registers and signals. 

The usages for names discussed above can be dis- 
tinguished by declaration. If all registers in the 
machine must be declared and all combinational func- 
tions are declared then the digital system is well de- 
fined for the purposes, say, of simulation. The simu- 
lator can maintain internal variables which keep track 
of the current values of each register, and whenever 
the simulator encounters the use of a combinational 
function output as a value it can examine the necessary 
combinational function definitions to determine this 
output as a function of the values of the internal vari- 
ables. Of course it may be necessary for the simulator 
to trace back through several levels of combinational 
function definitions in order to find registers whose 
values completely determine the final output. A 
summary of three possible variants of the assignment 
statement in a programming language based on the dif- 
ferent declarations and use of names mentioned above 
is given in figure 4. 
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DECLARATION AND USE OF NAMES 


1. Signal X 
X @ FCN(Y,Z) X is wired to output of FCN 
2. Signal X 
X:=FCN(Y ,zZ) The value of X if made to match 


the output of FCN for the dura- 
tion of this step. 


3. Register X 


X<FCN (Y ,2) The value of FCN is strobed 
into register X in this step. 


Implied Circuitry 


1. x FCN 


old X input 


old X clock 


ale 


is a Boolean which is true only 
for the duration of the step 
which describes the associated 
hardware. 


Figure 4 
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ABSTRACT 


The VDL system for the description of programming 
languages which was originally used for the definition 
of PL/I is extended to the description of processors. 
This paper shows the relationship between the language 
of definition and the abstract machine over which the 
semantics of the language are specified. It is demon- 
strated that the level of description can be chosen to 
suit the various needs of the computing community, each 
level being well nested within its outer level, whilst 
using only one language of definition. 


From the point of view of processor design, indications 
are given of the means by which a description can be 
transformed into an implementable system of data paths, 
registers and drivers. 


INTRODUCTION 


The techniques of formal definition as applied to the 
description of the PL/I programming language by Lucas 
and Walk (Lul), has since been applied to other systems 
by the author (Lel,Le2). On the basis that the 
definition of a programming language consists of a 
system of definitions of algorithms, the method of 
definition is applicable to not only languages, but 
also the description and definition of algorithms, in 
particular, to processors. 


The formai definitional system described by Lucas and 
Walk consists of a synthetic language defined to 
operate over a set of data objects which can be 
described in terms of non-cyclic trees. In the 
definition of a programming language such as PL/I, it 
is necessary to consider not only the definition of 
the syntax of the source language and a description of 
the technique for converting that language (or an 
analyzed version of it) into a form suitable for use 
in the description of its semantics. 


In the case of describing a processor, we shall pay 
little attention to the syntactic form of the 
associated machine language and no attention at all to 
the external form of that language. In fact, we shall 
assume taht any program to be executed exists only 
within the object which represents, in the abstract 
machine which models the prototype, the storage part 
of the prototype. 


The VDL definitional schema is so organized, as will be 
described later, that the techniques of top-down 
programming naturally evolve and thus the level of 
description can be matched with the needs for under- 
standing of the intended recipient of the description. 
Further, by a judicious choice of identifiers in the 
description, the understanding of the recipients can 
be enhanced to the point where the description is 
highly readable. Using the macro-expansion techniques 
of description which are analogous with the commonly 
used techniques of describing machine instructions in 
terms of lower level actions associated with event 
times, a single description can contain a continuum of 
definition levels. At the outer level, the description 
can correspond very closely to the style of description 
which is associated with a machine reference manual. 

At succeeding levels of definition, more detailed 
descriptions can be offered which reveal further 
details of implementation. For example, the outer 
level of definition may reveal that (say) an ADD 
instruction is executed by adding the contents of the 


accumulator and contents of the referenced cell, and 
then leaving the result in the accumulator. For most 
purposes of programming this definition will be 
sufficient; however, the further description of these 
components can show the utilization of the individual 
registers and the data paths between the registers. 
This level may not necessarily reveal the actions of 
the drivers for the registers, this being left to the 
next level, until eventually the logic level of the 
gates is reached. This ability of a single description 
language to provide these many levels of definition 
makes it a prime candidate for the general usage as 
processor descriptor. That is, rather than having 
several languages for the description of each level of 
a processor action, the Vienna Definition Language, by 
its design, provides for all the definitional needs of 
the computer architect. 


The major emphasis of the usage of VDL has been on the 
linguistic aspects of the definitional schema, little 
attention having been drawn to the abstract machine, 
the actions of which the language describes. The 
properties of this abstract machine are only now being 
investigated more thoroughly and it can be expected 
that these investigations will provide a firm basis 
for the development of the properties of the machine 
described. 


Within the definitional scheme itself there exists two 
levels of abstraction: the level of description used 
previously in the semantics of programming languages, 
and an inner machine, being a finite state machine 
over which the "outer level" descriptors are defined. 


THE INNER MACHINE 


A Definition Machine is a 5-tuple {£,9,P,u,tT} 
where 5 = Su {I} 

S is a finite non-empty set of closed one-to-one 
mapping functions (called selectors or selector 
functions) over @, 
is the identity selector or function, 
is a finite non-empty set of objects, 
where @ = CO u EO, and 
cO is a finite non-empty set of composite 

objects, 
EO is a finite non-empty set of elementary 
objects; 
P is a finite (possibly empty) set of predicates, 
u is the mutation operator, and 
t is the search function. 


er 


Objects in @ are defined formally to be a finite non- 
empty set of unique pairs (<s:A>) which specify the 
range (A e« @) of each selector function (s) in the 
set S over the domain of the object (B). 
Notationally, an object identified as B is represented 
as 


B= {<s,:A)>,<s, 


where {81 Sj 00098} = § and (Vi) (A, e 0) and Be @. 


tA,>yee0,<8 tA >} 


For simplicity, all pairs whose second component is 

the null object are normally omitted from this set. 

The set of unique pairs which specify the range of 

each selector in S over the object is called the 
characteristic set of the object. The application of a 
selector function to an object is symbolized by s(B). 
If B = {...,<s,:A,>,...}, then by definition above, 


ii 
s, (B) yields Ay. 


For the purposes of description, these characteristic 
sets of objects have been likened to non-cyclic trees, 
and thus the common representation of an object is as a 
tree shown in Figure l. 


FIGURE 1 
A TYPICAL COMPOSITE OBJECT 


B 


Since the objects selected by the selector functions 
from an object are themselves (by definition) objects, 
then the repeated application of selector functions is 
equivalent to a walk through the tree representation 
from the root to the root of some subtree. This 
repeated application of selectors leads to the usage 
of composite selectors. 


A composite selector K is the representation of the 
successive application of selector functions to an 
object. 


If K = s,°%...°S , 
n 


1 
S1°So% +28 (X) 


then 


= s,(so(.--(s_ (X))..-)) 
Df 
where (Vi) (s,€S) and K « gt 


As a matter of nomenclature, the selector function Ss 
(e S) is known as a simple selector. 


The object selected from a composite object by the 
composite selector K is known as the K-component. 


An elementary object within the machine (eo ¢€ EO) is 
characterized by a set in which the range of every 
selector is the null object (2). Elementary objects 
may be regarded as "atomic" or "indivisable" objects. 
The prgcise set of elementary objects associated with 
a definition must be defined in advance and may be 
dependent on the level of definition. For example, in 
the case of a user level of definition it may be 
sufficient to consider the set of elementary object to 
be words, whereas for the gate level of definition the 
set of objects may simply be the binary digits. This 
definition of an elementary object then provides a 
simple definition of a composite object: 


A composite object is an object in which the range of 
at least one selector function s e€ S is not the null - 
object. 


(4s)( SB) #2), se S, Be COcgG 
The primary function which operates over objects: is the 
mutation function yp which is a closed function over the 
set of objects @. Notationally, the function and its 
arguments are represented by 

u(A;<s:B>) 


The range of the function is 


(A - {<s:s(A)>}) u {<s:B>} 


That is, the mutation function creates a copy of 
object A (the subject argument) in which the s- 
component is replaced by the object B. This elemental 
function has the property that three basic operations 
can be simulated by its usage: replacement, deletion 
and construction. As described above, the basic 
operation of replacement is obvious; object B replaces 
the previous s-component in the copy of the object A. 
By specifying that the replacement object is the null 
object (&), then the process of deletion of simulated. 
Similarly, if the original subject argument (A) had 
been the null object, then any mutation of that object 
with non-null objects constructs a new object. 


The set of predicates in the inner machine provides a 
basis for the discriminating properties of the 
definitional schema. In the definition of programming 
languages, predicates are used to define the valid 
objects which can compose the abstract text (c.f., 
abstract syntax (Mcl)) over which the semantics of the 
language are to be defined. In a processor, these 
predicates describe the internal structure of the 
machine being modeled and certain properties which it 
is necessary to have the capability of recognizing, 
such as that the contents of the accumulator are zero. 
Combined with expressions, predicates form conditional 
expressions of the form 


5 A a 


which can be defined by the logical expression 


i 


These expressions, in the general case, result in an 
undefined value if none of the predicates are true. 
Whereas this is advantageous when the subject of the 
description is a programming language and there can 
exist some "undefined" situations, but in the case of 
a processor, these conditions should be closed properly. 
Considering the levels of definition discussed before, 
conditional expressions correspond closely to the gate 
level of description. The search function (t) did not 
originally exist in the Lucas and Walk (LU1) descrip- 
tions and definitions, but has been added by the 
author (LE1) to provide more generality. The Lucas and 
Walk unique selector function (1) is simply a special 
case of the search function. Further, the search 
function closely resembles the associative memory 
polling operation and provides a sound basis for the 
simulation of set operations in language descriptions. 


P; & (¥4< i) (“1 P,) De 


The search function t selects from @, a set of objects, 
each member of which conforms to the specified 
predicate is-pred. 


(tx) (is-pred(x)) = {x]x e @ & is-pred(x) = 7} 


The expression (tx) (is-pred(x)) is read as "the set 
of those objects (x) chosen from @ such that the 
predicate is-pred is satisfied." 


THE OUTER MACHINE 


Using the properties defined in the preceding section, 
we may now devise a definitional model, which will be 
the basis for describing processors. This finite state 
machine contains a set of states which contain infor- 
mation on the data being. manipulated and the instruc- 
tions (or programs) which define the transformations 

to be executed over the data, and a function (the 
State Transition Function) will interpret and execute 
the instructions in the current state of the machine. 


T Using a standard set notation. 


In attempting to define the properties of a processor, 
the state of the machine is defined to contain, as one 
of its components the complete set of registers and 
storage devices of the processor being modeled. Since 
the definition is itself a program, then the instruc- 
tions which reside in the storage part of the processor 
being modeled act as data elements. In the succeeding 
description here, we shall reserve the term instruc* 
tions to refer to the instructions contained in the 
definition machine. 


Within the state of the definition machine there exists 
a special component which contains the set of instruc- 
tions which are awaiting execution, and which by their 
execution will represent the execution of the commands 
in the processor being modeled. This component is 
known as the control stack and can easily be repre- 
sented by a regular VDL object. However, for the 
purposes of description we can regard the control stack 
to be a tree in which the definitional instructions are 
contained as the nodes of the tree (c.f., the VDL 
object represented as a tree, in which objects exist 
only as the leaves of the branches). 


By Lucas and Walk (LU1) the order of execution of the 
definitional instructions is defined to be restricted 
to any one of the instructions which exists at a leaf 
of the control stack. Since the execution of instruc- 
tions (see later) includes their removal from the 
control stack, this provides a multi-stacking facility 
whereby instructions can be inhibited from execution 
until all other instructions on their branch (in their 
stack) have been executed. Whilst Lucas and Walk 
insisted that any one of the candidate instructions can 
be executed during a state transition cycle, this 
concept is extended here so as to provide for the 
asynchronous execution of all instructions which are 
existing at the leaves of the control tree. This 
process adequately simulates the asynchronous opera- 
tions within a processor, but solves none of the 
problems of race conditions which are thereby possible. 
However, since the definition of instructions requires 
explicit reference to any data assignments and there 
exist no side effects within VDL, there exists a clear 
potentiality for proving that race conditions either 
exist or are non-existent. 


The initial state of the definition machine is one of 
the elements of the definition of each processor. This 
may correspond directly to the conditions which are 
existing at the time that the manual actions of 
depositing an address into the program counter and 
depressing the RUN key are performed. A final state (a 
halting state) of the definition machine is the state 
in which the contents of the control stack is null; 
that is, there are no further definitional instructions 
to be executed. Other final states may include cases 
where some error condition has arisen and the execution 
of the instructions existing in the stack is undefined. 


Definitional instructions can be executed (depending on 
conditions existing within the state of the machine) 
either as macro-expansion instructions or as state- 
modifying instructions. In the former case, the 
execution of the instruction has the effect of 
replacing itself by a new instruction subtree thereby 
simulating either the passage from one level of 
definition to the next or the sequencing of operations. 
In the case of state-modifying execution, the effect is 
to mutate the state of the machine (other than the 
control stack) thereby simulating operations over the 
registers in the prototype, and then to remove that 
instruction from the control stack. 


Whilst there is only one style of execution that an 
instruction be subject to at the time of its execution, 
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the definition of instructions can specify varying 
styles depending on the conditions existing at the 
instant of execution of the instruction. Thus a 
definitional instruction may have several definitions 
itself, only one being applicable at any time. These 
individual definitions are termed "groups." 


The means by which definition groups are chosen from 
within the general instruction definition set is a 
conditional expression, the right hand sides of which 
are the definition groups. That is, the general form 
of an instruction definition is 


inst (q,5+++»4)) = 


P, > group, 


oe ie ck 2 


where q.,...q are parameters which are replaced by the 
values of the arguments specified in the instruc- 
tion at the time that the instruction is placed into 
the control stack, p,,...,p. are predicate expressions 
which are functions of the set of parameters q, 
system defined predicates and the state of the machine. 
It will be shown later that in the case of describing 
processors at the register level, the set of parameters 
(and consequently the corresponding set of arguments) 
is unnecessary, the need for a parameter showing the 
need for a register in the prototype. 


Where the group is to be a macro-expansion definition, 
the notation is to show not only the set of instruc- 
tions which are to replace the instruction being 
executed in the control stack, but also the structural 
relations between those instructions. The notation 
contains two basic rules for demonstrating the nodal 
position of instructions within the tree: 

-i) indentation indicates a lower level of tree 
placement (lower in the sense of movement 
between the root at the top and leaves at the 
bottom) than instructions not as deeply indented. 
punctuation indicates either a continuation of a 
level by the use of a comma (,) or completion of 
a level by the use of a semicolon (3) except 
where the instruction is the last in the group 
when no punctuation is needed. 

It is important to note that since the order of exe- 
cution of instructions is from the leaves of the tree 
toward the root, then the instruction(s) at the bottom 
of a group representation are the earlier candidates 
for execution. Normal sequential execution of a group 
of instructions is represented by a diagonal sequence 
of instructions separated by semicolons: 


ii) 


inst-1; 
inst-2; 
inst-3; 
inst-4 


This set of instructions would be executed in the order 


inst-4 inst-3 inst-2 inst-1l 

A single instruction cannot be replaced by a set of 
asynchronous instructions since such a set does not 
form a proper tree structure. Instead a simple one 
level tree with one root must be formed. In essence 
this corresponds to the case where a number of instruc- 
tions can be executed simultaneously and the execution 
of a succeeding instruction must await their completion 
The root instruction in this group then acts as a 


semaphore since it prevents the execution of 


instructions higher on the same branch until it is 
cleared. Such a group of instructions is represented 
in the form 


inst-1; 
inst-2, 
inst-3, 
inst-4 


In this group, the instructions inst-2, inst-3 and 
inst-4 can be executed asynchronously (for our purposes 
here) but inst-1 cannot be executed until all of those 
instructions have run to completion. 


State-modifying definition groups specify the changes 
to be made to the state of the machine (with the 
exception of the control stack). Each group 
corresponds closely to a mutation operation, the 
subject argument of the mutation being the state of the 
machine. Thus the definition group is a listing of the 
selector:value pairs, the selectors being applicable to 
the state of the machine and the values being functions 
over the parameters (replaced by the argument values) 
of the instruction (if any) and components of the state 
of the machine. The general form of a state-modifying 
group is 


S~SC):exP, 


S-sc :exp 
—m m 


where the s-sc, are selector functions and exp, are 
evaluated to the values which are to be placed in 
the state. By the judicious choice of selector names, 
the data paths in the processor can easily be simulated. 
For example, let us assume that the memory address 
register is represented as the s-mar component of the 
state (&) and that the program counter is represented — 
as the s-pec component. Then the operation of transfer- 
ring the contents of the program counter to the memory 
address register can be represented by the definitional 


instruction pc-to-mar and be defined simply by 
pc-to-mar = 


s-mar:s~pc(é) 


which states: 
"Replace the contents of the s-mar component of the 
state by the contents of the s-pc component of the 
state," 


Since we are dealing with a finite state abstract 
machine, the question of timing between the acquisition 
of the data elements of an operation and the placement 
of the result in the state is overcome by the simple 
ruse that the new state is a copy of the old state. 
Thus the execution of an elementary shift command (over 
a three bit register) is well defined: 


shift = 
bit-O*s-acc:bit-l*s-acc(é) 
bit-1*s-acc:bit-2°s-acc(é) 
bit-2°s-acc:bit-0°*s-acc(é) 


In this definition, the selector functions are 
composite, the accumulator being represented by the 
s~acc component of the state and the individual bits 
within the accumulator being selected by the functions 
of the form bit-i. That is, the functional composition 
operator (*) can be read as "of". Since it will be 
_fecessary to reference elements of state components in 
a generalized form, we shall permit the extension of 
the explicit naming of selector functions to include a 


functional notation in which the index of selection is 
included as an argument. For example, if the memory 
component of the prototype is represented as the s-mem © 
component of the state of the abstract machine, and 
the memory is divided into pages, each page containing 
a number (presumably fixed) of words, then a reference 
to a single word will require three functional appli- 
cations to select the word from the state. To ac- 
complish this will require the provision of two 
arguments; the word address (or index) and the page 
address. Thus it would be possible to develop a word 
reference mechanism in the form of a composite 
selector function 


s-word(word-address) *s-page (page-address) *s-mem 


Thus the definition of the store operation might be 


store = 


s-word (s-wa*s-mar(€) ) *s-page(s-pa*s-mar(é) ) *s-mem: 
s-mbr(&) 


where s-wa selects the word address from the memory 
address register, and correspondingly the s-pa function 
selects the page address, and s-mbr(&) represents the 
memory buffer register into which (by some previous 
step) the value which is to be stored has been placed. 


This complexity of structure is defined in terms of 
predicates which describe the abstract syntax (i.e., 
structure) of the state of the machine. In part, for 
this mythical machine which we have been considering, 
the state can be defined by the predicates: 


is-f= (<s-mem:is-memory>, 
<s-mbr:is-word>, 
<s-acc: (<s-link:is-bit>, 
<s-body:is-word>)>, 
<s-mar: (<s-ma:is-word-address>, 
<s-pa:is-page-address>)>, 
a) 


where each of the pairs in the structured predicate 
specify the name of the branch on which the component 
is located (in the tree descriptive sense) and the 
structure of the component. Each of these descriptions 
must eventually be defined in terms of the elementary 
objects in the system, so that, for example, the s- 
link*s-acc component of the state is defined to be in 
conformance with the predicate is-bit, which defines a 
set of elementary objects. On the other hand, the 
memory buffer register (s-mbr component) is defined to 
be of the form is-word which we will define by the 
structure 


is-word = ({<bit(i) :is-bit>|0<i<11}) 


That is, the structure is composed of a set of pairs, 
the object of each of which is a bit (defined by is- 
bit) and the selector of which is of the form bit(i) 
where the value of the index i is in the range {0,11}. 
Effectively this defines a 12 bit word. 


THE BLUE MACHINE 


For the purposes of discussion here, let us examine 
the structure and description of a simple processor. 
The machine chosen is that described by Foster (Fol) 
since his description (from a pedagogical point of 
view) fits our purposes well. 


BLUE is a binary, two's complement, stored program, 
fixed word length, parallel, digital computer with 
4096 words of 1 usec co-ordinate addressed core 

storage of 16 bits per word. Each word may contain 


either a 15 bit integer numeric representation plus 
sign, or a 16 bit instruction composed of a 4 bit 
operation code and a 12 bit address. No index 
registers, no indirect addressing and no interrupt 
facilities are included, though as may be seen from 
the descriptions, it would not be conceptually 
difficult to add these features. The general picture 
of BLUE is shown in Figure 2 and the corresponding 
representation of the components in the state of the 
abstract defining machine is shown in Figure 3. For 
the purposes of our discussion here we shall assume 
that the external operations of loading the program 
counter and starting the operation of BLUE by the 
pressing of the appropriate buttons result in the 
deposition of the low order contents of the switch 
register into the program counter and the setting of 
the run flip-flop to RUN (represented by 1) respec- 
tively. No specific descriptions of these actions will 
be included since these are manual rather than 
automatic operations. Foster describes the basic 
cycles of the BLUE machine as being composed of two 
parts; the FETCH and the EXECUTE cycles. It is assumed 
that the STATE flip-flop which defines which cycle is 
to be entered next, will be set to F initially, there- 
by assuring the correct sequence of operations. The 
actions of the FETCH cycle are described in Table 1 
(from Fol). 


TABLE 1 
The Fetch Cycle Elements 


Clock 
Pulse Action 

initiate read-restore ‘ 

+1 > PC f Read time 


clear MBR 
Begin decode 


clear IR 
(MBR) > IR 
Pore } Restore time 


ON NOP WD 


May change contents 
of MAR 


The last three pulse times in this sequence are 
available for the execution of the various non-memory 
referencing instructions such as HLT (halt), JMP (jump) 
or CSA (console switches to accumulator), or for the 
set up operations necessary for the execution of two 
cycle instructions. 


Close examination of the description of the first part 
of the FETCH cycle (which is common to all BLUE in- 
structions) shows that there are at least two opera- 
tions occurring simultaneously during pulse times 2 
through 4; that is, the action of fetching the instruc- 
tion from memory (at a location determined by the 
contents of program counter) initiated at pulse time 1 
is operational through pulse time 4, at which time the 
contents of the memory location are available in the 
memory buffer register. Whilst this action is 
continuing the other actions of incrementing the™ 
program counter (time 2), and clearing the MBR and IR 
are executed in parallel. During times 5 through 8, 
the memory is being restored and thus additional 
parallel operations are proceeding during these pulse 
times. This verbal and tabular description can be 
converted into a VDL instructional system which is 
equally expressive: 


fetch = 
part-2; 
register-set, 


initiate-read 


where 


register-set = 


clear-ir; 
clear-mbr; 
inc-pc; 
no-op 


and 
initiate-read = 


mem-to-mbr ; 
no-op; 
no-oOp; 
no-op 


FIGURE 2 


THE BLUE MACHINE 
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FIGURE 3 


THE VDL OBJECT REPRESENTING THE BLUE MACHINE 


E 
©e 
s-run 
s-state 
S—-pc | 
s-sw 
° O 


where the instructions no-op are used to show the 
relative timing of the two set of instructions. 

In this case it is not clear from the description 
of the fetch cycle what actually occurs in the 
initiate read operation in BLUE during pulse times 
1 through 3, though it is clear that in pulse time 
4 the contents of the selected location are placed 
in the MBR. This operation can be defined by the 
instruction 


mem-to-mbr = 
s-mbr :word(s—mar(&) ) »s-mem(&) 


Similarly, the instruction to increment the 
program counter may be defined by the group 


inc-pc = 
s-pe:s-pe(—) +1 


Immediately we must question whether this defini- 
tion is sufficient. From the point of view of the 
programmer, this definition clearly states the 
action which BLUE is to take; however, from the 
point of view of the designer (or someone else 
interested in more details) this definition might 
better be expressed in the form 


inc-pc = 
s-pc:add(s-pc(é) ,1) 


where the function add is to be defined further. 
For a programmer this depth of definition may well 
be sufficient, but by considering a function to be 
equivalent to a logical circuit which could be 


Ss~z—-reg 
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s-mem 


s-mbr 


defined by a logical expression. In any case this 
definition can be translated as being represen- 
tative of the circuit shown in Figure 4. 

FIGURE 4 


THE PC INCREMENT SYSTEM 


program counter 


Once the two instructions which preceed part-2 in | 
the definition of fetch have been cleared off the 
control stack, then the second portion of the 
fetch cycle can be initiated. As in the first 
cycle this contains two parallel actions; the 
decoding of the instruction and the restoration of 
the memory. Thus part-2 can be described by the 


group 
part-2 = 
next-state; 
decode, 
restore 


where the decode instruction is expanded into the 
sequence 


sieve; 
mbr-to-ir 


and where sieve is the instruction which replaces 
itself by the sequence of operations which result 
in the execution of the BLUE instruction. This 
instruction can be defined by the conditional 
expression 


sieve = 
oct(s-op*s-ir(é)) = 
oct(s-op*s-ir(é)) = 


0 > execute-hlt 
1 > execute-add 


oct(s-op*s-ir(é)) 


17 > execute-nop 


where the selector function (defined in the 
abstract syntax of BLUE) s-op selects the 
operation code portion of the instruction from 
the instruction register (the s-ir component of 
the state €). This portion of the instruction is 
represented by a tree and therefore true equality 
can only be attained if the comperand is also a 
tree. However, we have chosen to overcome this, 
at this level of definition by the use of the 
function oct which we define to develop the octal 
equivalent of the tree representation. This 
object can then be compared with the octal 
operation codes. To be more precise at a lower 
level of definition it would be necessary to 
describe this sieving operation by logical 
expressions of the form 


bit(15)*s-ir(&) = 0 
bit(14)*s-ir(&) = 
bit(13) -s-ir(&) 
bit(12) *s-ir(é& 


1 om 


& 
1 
) > execute-jmp 


which more precisely mirrors the structure of the 
binary decoding tree for BLUE. 


At the end of the fetch cycle the STATE flip-flop 
is set to indicate which of the possible two 
states is to be entered next; E indicates the 
execute cycle, F indicates the fetch cycle. Thus 
the instruction next-state which is the final 
instruction in part-2 is the switch which deter- 
mines where the processing should continue. An 
alternative means of specifying the sequence of 
steps in the fetch cycle which are directly 
related to the pulse times would be to define the 
fetch instruction as a sequence of instructions 
each of which is related to the pulse time and 
which then leaves the next pulse time operation as 
the next instruction to be executed. That is, 

1. fetch = 


pulse-time-1 
pulse-time-1 = 
pulse-time-2; 
initiate-read 
pulse-time-2 = 
pulse-time-3; 
inc-pc 
pulse-time-3 = 
pulse-time-4; 
clear-mbr 
pulse-time-4 = 
pulse-time—5; 
clear-ir, 
mem—-mbr 


2. 


3. 


4. 


5. 


and so on. 
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This scheme would have the advantage (from the 
point of view of the reader) that the actions are 
directly related to the pulse times and the dummy 
no-op instructions are obviated. 


SUMMARY 


The description and design of BLUE was sufficient 
to indicate the ability of the VDL techniques for 
describing the operations of a processor. However 
this was an exercise in the description of an 
already existing machine and thus no untoward 
problems came to light. If VDL were to be used as 
a design tool then some directions are necessary 
to derive an implementation from a description. 
Obviously some simple comparisons can be drawn 
between instructions and the structure of the 
machine; that is, for example, state-modifying 
instructions represent data paths between elements 
of the machine. Macro-expansion definitions can 
be interpreted in one of two manners; either an 
expansion is the passage from one level of 
description to another, as in the description of 
the inc-pc instruction, or it represents the 
sequencing of operations which current state of 
the machine, as in the case of the instruction 
decode. The precise manner of discriminating 
between these two uses is not entirely clear at 
this time and requires further investigation. 


In the version of VDL which is most general, and 
which has been used for the description of 
programming languages, the instructions are 
accompanied by a set of arguments which are passed 
through the control stack. Such arguments, in a 
processor, require some medium of transmission and 
can be construed to be indicative of the need for 
a register within the prototype. That is, if a 
definitional instruction cannot be expressed with- 
out the use of additional data which is passed 
through the argument list, then an additional 
register is required in the prototype together 
with the appropriate data paths. 


This presentation has shown the many levels of 
description which can be served by a single 
unified definitional schema and has emphasized 
earlier that the schema is a continuum from the 
instruction level of definition to the abstract 
machine which underlies the system. Work is 
already in progress to develop the properties of 
the definitional system (see Lel, ch.2) and to 
develop means for the validation of definitions. 


Finally, it must be recognized that not only has 
VDL the power to be a definitional system for the 
description of processors, but also is capable of 
providing a common base for the definition of 
other descriptive techniques. This capability may 
well provide the means by which the equivalence of 
descriptive elements of other languages can be 
proved, and further will not require the abandon- 
ment of other descriptive techniques merely to 
satisfy the ambition of a unified approach to 
processor description. 
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Abstract 


A methodology is developed for determining how 
much parallelism is optimal if a given job stream is to 
be executed without multiprogramming. Qualitative de- 
sign tradeoffs are inferred from the cost-performance 
effect of parallelism on different hardware subsystems. 
Measures of software parallelism are analytically re- 
lated to measures of hardware performance. It is shown 
that an increase in hardware parallelism may be desir- 
able even though it causes an increase in job process- 
ing cost and/or a decrease in hardware efficiency. 


INTRODUCTION 


There have been numerous papers written about the 
impact of LSI on computer architecture. Many authors 
have pointed out that the technology of the inexpensive 
computer-on-a-chip will make systems with a high degree 
of parallel processing and multiprocessing economically 
feasible (5,6,7,12). Kuck has proposed that, by decom- 


posing a program into its concurrently executable parts, 


these highly parallel systems will be economically 
viable even when used to execute one program at a time 
(monoprogramming) (7). On the other hand, Chen has 
demonstrated that highly parallel systems are doomed to 
be very inefficient, and he has suggested that multi- 


programming is mandatory if such systems are to be prac- 


tical (3). This apparent disagreement stimulated the 
analysis made in this paper. We do not claim to have 
resolved this conflict in favor of one or the other of 
these authors. In fact, our inquiry is limited to an 
analysis of monoprogramming applications. However, we 
do feel that we have developed a methodology for deter- 
mining how much parallelism (if any) is optimal if a 
given job stream is to be executed without use of 
multiprogramming. 


In this methodology we will emphasize the consid- 
eration of what the user is willing to pay for a parti- 
cular computational service. In Section I, we explain 
how we think this consideration can be applied in the 
design process. In Section II, we derive certain 
qualitative design guidelines that can be inferred from 
this consideration. These guidelines may be obvious to 
the experienced designer, but we feel that it is signi- 
ficant that they can all be inferred from this one 
consideration. 


The performance of a parallel hardware system will 
be considerably influenced by the parallelism inherent 
in the software. In Section III, we present some 
possible measures of software parallelism and derive 
expressions relating these measures to measures of 
hardware performance. In Section IV, we show how one 


S1 


of these expressions can be used in determining the 
optimal degree of hardware parallelisn. 


Section I 


Many different measures of computer performance 
have been suggested and used. The most common measures 
used for general purpose computer applications are 
throughput rate, response time and equipment utiliza- 
tion. Systems designed for less than general purpose 
use may be evaluated against other measures such as the 
mean time for high priority jobs to get processed or 
the mean job starting delay (9). In comparing dif- 
ferent hardware equipment, the price of the unit can be 
included in defining the performance measure, resulting 
in measures such as price per instruction ratio and 
price per register ratio (12). Other authors have 
suggested that the performance measure should not only 
include cost, but must also include a measure of the 
effectiveness with which the system provides service to 
the user (10). We feel that, for the general user, a 
good measure of the quality of service provided is the 
time required to process his job. Thus, a computer 
performance measure should include the cost of process- 
ing the job and the time required to do that processing. 
This is not a new idea. Lehman used these two factors 
when he suggested that the performance of multiprocess- 
ing systems be compared by computing the product of the 
cost of processing and the job throughput time (8). 
(Using this measure, the best system would of course 
have the least product.) 


One can, however, argue that Lehman's choice of 
the product of these two factors is arbitrary; there is 
no a priori reason for selecting the product over any 
other functional relation between these two quantities. 
In fact, we claim that a system designer should not 
work with a simple functional relationship of this sort. 
The following discussion explains why this is so. 


Consider a user who has a particular computational 
job that he wants done. Assuming that he has some 
experience in running his job on various systems, he 
will have a pretty good idea of what he is willing to 
pay to get the job done. Also, what he is willing to 
pay will depend somewhat on how long he must wait for 
his results. From time to time, the job turnaround 
time that he requires may vary, and as it varies, what 
he is willing to pay may also vary. Figure 1 illus- 
trates the general way in which the user will relate 
these two factors. This figure is not meant to be 
drawn against any scale; it is just meant to illus- 
trate that this curve will have three distinct regions. 
In Region I, the user is telling us that a further de- 
crease in his job's processing time is of no value to 


him, and he will not pay more for this better service. 
In Region II, he is willing to trade cost for "service" 
in some manner. In Region III, the processing time is 
so long that the service is of no practical value to 
this user. 


In Figure 2, points A, B, C, D, E and F represent 
hypothetical hardware executions of our user's job. 
They each represent a different system because the 
same system would always run the job for the same cost 
with the same processing time. (For simplicity we are 
not considering systems where interactions with other 
jobs may influence our job's processing time.) Points 
A and F represent hardware solutions which are unac- 
ceptable to our user. Points B and E are acceptable 
points, and it is important to note that they are 
equally acceptable to the user; he does not prefer one 
of these over the other even though their respective 
costs and processing times may be markedly different. 
Points C and D are both preferable to points B and E. 
(e.g. Since the user is willing to pay "B's" price, he 
finds ''C's'" lower price for the same service time pre- 
ferable.) 


JOB 
COST 
JOB PROCESSING TIME 
FIGURE 2 
JOB 
COST 


JOB PROCESSING TIME 


Having determined this cost-service tradeoff 
curve for our particular user, we are unable to say 
whether or not he would prefer system C to system D. 
One might suggest that we interrogate our user further 
concerning his preferences in the region of the graph 
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-an extensive market survey. 


below the tradeoff curve. Since we intend that our 
hypothetical user be representative of a potential mar- 
ket of users, this interrogation would really amount to 
Furthermore, points C and 
D cannot represent existing systems. Our knowledgeable 
user would naturally have drawn the curve of what he 
was willing to pay, in such a way that all existing 
systems would lie either on or above it. Points C and 

D can, however, represent designed systems which have 
not yet been marketed. But, the question of produc- 
ing system C or D is basically a marketing decision. 


The job of the system designer is to produce a de- 
sign such as system C or D, either of which is clearly 
better than all existing systems. Thus, the designer 
needs this curve and a methodology for producing de- 
signs which will have "operating points" below it. A 
valid performance measure could provide this tradeoff 
curve, but clearly such a measure would not be a sim- 
ple functional relationship that one could postulate 


a priori. 


So far, we have limited our discussion to the case 
of one hypothetical user with one job. If we were de- 
signing a system for only one user, a design with a 
projected operating characteristic such as C or D would 
clearly be a viable project. But a truly viable pro- 
duct would have to provide satisfactory service to many 
users, and for each user (indeed, for each different 
job!) there will be a different tradeoff curve. 


In order to limit this multiplicity of tradeoff 
curves, we propose that both the quantities cost-per- 
job and processing-time-per-job be normalized by di- 
viding them by a measure of the total "work" required 
by the job. (The quantification of "work" which we 
propose is discussed in Section III). For instance, if 
a user has a job that basically consists of two identi- 
cal subjobs, he will expect the job to cost twice as 
much and require twice as much time to execute as would 
one of the subjobs. Since the job contains twice as 
much work as the subjob, the normalized curves for the 
job and the subjob will coincide. Furthermore, since 
the curve is essentially determined by the prevailing 
market of available computational service, this nor- 
malized curve, for a particular type of computation, 
should not vary appreciably from user to user. Thus, 
some type of normalization of these curves is required, 
and we think that this normalization factor is a 
reasonable one. 


Consequently, a curve of this nature can be ob- 
tained and it can be of great aid to the designer. 
For instance, if an existing system has an operating 
point such as point B in Figure 2 (i.e., it is in Re- 
gion I of Figure I), the design of that system can be 
improved only by a change that will reduce the job 
processing cost. On the other hand, if one is trying 
to improve system "E", reducing job processing time is 
as important as reducing job cost. In the next section 
we will show hew this curve can be used in determining 
the relative merit of different hardware changes that 


might be made to an existing design. 


Section II 


We will now briefly develop some qualitative hard- 
ware design guidelines that can be inferred from the 
general shape of the user's tradeoff curve discussed 
in Section I. In this development we will employ the 
terminology and notation suggested by Bell in cate- 
gorizing hardware functional modules as data operators 
(D), controllers (K), etc. (2). We will consider three 
types of changes that could be made in a design: (1) 
change of technology used, (2) change of amount of 


parallelism in K and (3) change of parallelism in D. 


The shape of the curve in Figures 1 and 2 tells 
us that any design change that both reduces the cost 
and reduces the processing time will be a good one. 
(Of course, intuition or common sense could have told 
us that!) However, if we are not able to reduce both 
these factors simultaneously, we may still be able to 
improve the design. If we know the present design re- 
sults in an operating point in Region I of Figure l, a 
design change which reduces the cost of processing will 
be good even if it increases the processing time. In 
Region II, a change which reduces processing time while 
increasing the cost may be desirable. 


The use of faster more expensive technology will 
reduce processing time and may or may not reduce pro- 
cessing cost. Thus, it will in general be a valid 
design change in Region II, but not in Region I. (In 
fact, in Region I, the use of slower, less expensive 
technology will be desirable if it will reduce the 
cost of processing.) 


The use of parallel K (e.g. multiprocessing) will 
reduce the processing time but will usually not reduce 
the cost of processing. Thus, increasing the paral- 
lelism of K may be a good design change in Region II, 
while decreasing it may be called for in Region I. 


Increasing the parallelism of D while keeping K 
non-parallel will, up to a point, decrease the cost of 
processing. The simple example of a parallel adder 
explains why this is so. As long as the width of the 
adder can be effectively utilized, doubling the width 
will halve the add time. But, doubling the width will 
not double the hardware cost since the cost of the con- 
troller will not change. Thus, the processing cost 
will decrease. At some point, however, due to ineffec- 
tive use of the increased width, the cost increase will 
not be offset by the decrease in average add time, and 
the processing cost will increase. Consequently, with 
respect to processing cost, there is some optimal de- 
gree of parallelism in D. This will also be the opti- 
mal degree of D parallelism for a design in Region I 
of Figure 1. However, in Region II, more parallelism 
than this "optimal" amount may be desirable. 


We note that Bell's entire approach of dividing a 
system into components M, L, K, D, and so on should be 
useful in analyzing costs. Just as we have evaluated 
the cost of serial/parallel adders one could evaluate 
larger systems by the effect of parallelism on each 
division of the system. The analysis in this section 
has been qualitative. In the remainder of this paper, 
we will develop some quantitative relationships which, 
when used with the user tradeoff curve, can be helpful 
in determining the optimal degree of hardware paral- 
lelism. 


Section III 


The optimal degree of hardware parallelism will, 
of course, be dependent on the parallelism of the job 
for which it is designed. In discussSing job parallel- 
ism, we will employ the job "Space-time" diagram sug- 
gested by Chen (3). Figure 3 illustrates the space- 
time diagram of a hypothetical job. The "widths" W, 


represent the relative parallelism of the job during 
the time interval t,- A machine with no parallelism 


would require DWt, time units to process a given 
af 


job. Thus we define the total work associated with 
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job to be Pe This is the factor which we will 
i 

use to normalize our job-cost versus job-processing- 

time graph. (Thus, the normalized cost of a job will 


be cost/ DoW,t,.) 
i 


FIGURE 3 


JOB 
"WIDTH" 


An intuitively appealing way to quantify job 
parallelism would be: parallelism = time-average job 
"width" or 


ey = LUM, (ty / dt,) = DN ed Doty 
it j i i 


Using this measure, the minimum possible parallelism 
is one, and there is no maximum. This definition of 
parallelism can be modified so that it takes on values 
between zero and one by defining 


Pp. = 2Mst, / max (W,) duty 


In a machine having parallelism equal to max (W,), this 
e ° ° e 1 
definition would correspond to i 


Pa Space-time used / Space-time available 


Chen has suggested that job parallelism be defined as, 


Ee Amount of space-time showing parallelism 


‘ total space-time of job 


3 
Letting t be the total time that the job has no paral- 


lelism, we have 


ane 2 Wt, / dW ity 
i#s i 


Chen has also defined machine efficiency (n) to be 


7 total space-time of job 
total space-time swept by hardware 


n 
As we increase the parallelism "width" (N) of a 
machine, the normalized processing time for a job, T, 
will decrease until N = max (W,)- For N 2 max (Ws) 
i i 
T will have a minimum value. 


a oe ae 
1 1 121 


Thus, we see, 


T 


min =1/ Py 


Tin? the maximum efficiency occurs when 


N = max (W,)- If we call this maximum efficiency no? Section IV 


i 
we have In this section we will illustrate how the user 
cost-processing time tradeoff curve can be used to de- 
n = > oW.t, / max (W.) yt. termine the optimal degree of hardware parallelism. We 
. i. os i ag oe will keep design factors such as the technology used 


fixed and observe the effect on job cost and job pro- 

or = 09 cessing time caused by varying the degree of hardware 
? 2 parallelism. We will then be able to select the "best" 
degree of parallelism for a particular job by observing 


3 
I 


Thus if T = Tin = 1/o,5 where these points lie with respect to the user's trade- 
off curve. 
NS Po 
The normalized cost of processing a job is, 
Consequently, for highly parallel hardware systems Cost = HT 


(i.e. where N > max (Wd), the software measures Py 


where H is the cost of the system per unit time (rental 
cost), and T is the normalized processing time. In 
parallel hardware systems, H is of course a function of 
the amount of parallelism in the system (N). If a 
system is "totally" parallel in the sense that it has N 
of all its functional modules (and if the cost of sys- 
tem software is negligible) , we might expect the rent 
associated with this hardware to increase linearly 
with N. In that case, 


and can yield quantitative information about the 


Po 


performance of the system. 


The following analysis shows that, if 
N < < max (W5)> other quantitative relationships can 


be derived. If a machine has a parallelism "width" of 
W.-1 
P é ; ‘ 1 
N, it will require the interger part of Cy + 1) Cost = RBNT 
"passes" to process the ith parallel section of the 
job. If we approximate this number of passes to be 
equal to W,/N, our normalized total processing time is, 


where R is the cost per unit time of the basic, non- 
parallel module. 


| Many parallel systems are, however, not totally 
T= (t zi > Wt.) 7 ower: parallel. Parallel processing systems such as the 
P N ids ae 1 a ILLIAC. IV and STARAN consist of parallel execution 
elements under the control of a single instruction de- 


We note that, coder (1,11). Doubling the degree of parallelism in 
1 t such a system does not double its total cost. Also, 
T= we can expect that system software costs will not in- 


crease in proportion to the degree of hardware paral- 

lelism. Consequently, a cost which increases linearly 
with N is probably a "worst case" assumption. Perhaps 
a more realistic assumption would be to use Grosch's 


s 
= re) - —-—————- 
N °3 > Wot, 


1 


Also, Law which states that the system cost will increase in 
o a proportion to the square root of the power of the pro- 
(1 Ps) ee x Wt, / SLAF cessor. Since the amount of parallelism is a measure 
eee : of the power of the processor, we have, 
< = a 1, 
(1-3) C2aWyty Py We) a 2aWse, Cost = RN 
(25.54 ] yw We do not claim that either of these simple for- 
a a tS : id mulas is valid for all cases. We will use them merely 
= to demonstrate the methodology which we are developing 
Thus in this section. Presumably, the designer will be able 
9 


to fairly accurately estimate the way in which system 
cost will vary with N for the particular type of 

parallelism he is considering. In employing this de- 
sign methodology, he should of course use his estimate 


We also note that rather than one of these simple formulas. 
n = Wt, /N (Ct + >: W,t,) Figure 4 pertains to the formula 
i e ifs ? 1 
Cost = BNL 
Thus, 


while Figure 5 illustrates the situation 


n= L/NT = 1/[o, + N (1-p,)] Cost RNT 
ost 


Consequently, using Chen's parallelism measure, 
we may easily approximate the normalized job process- 
ing time required by a machine having N levels of 
parallelism. We now have an analytical means of map- 
ping a job stream containing a range of parallelism 
into a distribution of normalized processing times. 
In the next section we will discuss how this will help 
us determine the optimal degree of hardware parallelism. 


In each of these figures, the degree of hardware paral- 
lelism is varied from one to eight, and the degree of 
job parallelism (as measured by Chen's parallelism de- 
finition) is varied from 0.5 to 0.95. 


Once a designer has obtained a graph of this sort 


based on his estimates of system cost and job stream 
parallelism, he can superimpose his user's cost versus 
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Figure 4 


3R 


N=8 
X 2: 93 = -50 
@: P., = .80 
OR O: 0, = .85 = 
A: Ps = .90 . 
Oo: P3 = .95 N=2 
R @ N=1 
Cost . : 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
Normalized Processing Time (T) 
Figure 5 
x P3 = .50 
8R N=8 
@: e, = -80 
QO; P, = 85 
A: P= 90 
6R OD: p, = .95 
4GR =4 
Possible Tradeoff Curve 
2R oe ee ra 7 Bee 2s a N=2 
i . fi P : a —_—. — aa — sg =] 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 


Normalized Processing Time (T) 


processing time tradeoff curve. Having done this, he 
can immediately identify which hardware-software 


parallelism combinations will correspond to viable pro- 


ducts. The selection of the "best" of these viable 
combinations may or may not be trivial. 


If we visualize cost-time tradeoff curves of the 
type illustrated in Figure 1 superimposed on Figures 4 
and 5, we can make the following conclusions. 


(1) If the non-parallel hardware design produces 
an "operating point" in Region I of Figure 1, hardware 
parallelism will be justified only if the software is 
highly parallel and if cost of the system does not in- 
crease linearly with the degree of parallelism. 


(2) If the non-parallel hardware design produces 
an operating point in Region II of Figure 1, some de- 
gree of parallelism may be justified even if the soft- 
ware is not highly parallel or even if the system cost 
increases linearly with the degree of hardware paral- 
lelism. (e.g. for the tradeoff curve in Figure 5, 

N = 2 or 4 would be a good design if Pz 2 -80). 
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(3) If the non-parallel hardware design produces 
an operating point in Region III, some degree of hard- 
ware parallelism is mandatory. 


(4) As the degree of hardware parallelism is in- 
creased, the "spread" of the operating points for a 
job stream of differing parallelism also increased. 
Thus, if one has a job stream encompassing a substan- 
tial spread of software parallelism, it might be de- 
sirable to divide it into subsets having small paral- 
lelism variation and then determine the best degree of 
hardware parallelism for each subset. 


As a final point, we wish to make some observations 
relative to the issue of hardware efficiency. In Sec- 
tion III, we derived the relationship 


n = 1/[N (1-p4) + Ps] 


As Chen points out, this efficiency measure drops 
rapidly with increasing N, even if op, is high. For 


instance, if N = 8 and op, = .8, n = 142. One would 
think that a system that°was only 42% efficient would 


be a poor design and that this combination of N = 8 
and ©, = .8 could be rejected on that basis. However, 
Figure 4 illustrates that, using the design methodol- 
ogy outlined in this paper, this inefficient design 
might be the best system from the user's point of view. 
Thus, we feel that even though Chen's definition of 
efficiency is reasonable, one should not use it as a 
performance measure in determining if a design is via- 
ble. (€Of course, we have restricted our investigation 
to monoprogramming systems. Therefore we do not claim 
that this comment is necessarily applicable to multi- 
programming systems for which high efficiency is a 
dominant design goal.) 


SUMMARY 


The consideration of the user's cost-performance 
tradeoff curve has enabled us to present a unified 
approach to the derivation of important architectural 
design guidelines. We have derived relationships be- 
tween "software parallelism" and the performance of 
systems with different degrees of hardware parallelism. 
Using these relationships, we have shown that situa- 
tions may arise where an increase in hardware paral- 
lelism is desirable even though it causes an increase 
in the job processing cost. Also, we have shown that, 
for a non-multiprogrammed system, the optimal system 
may exhibit a rather low hardware efficiency. 
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ABSTRACT 


An array of very simple processing elements is des-— 
cribed each with a local semiconductor store. The 
array may also be used as main storage. 


Bit-organisation gives great flexibility, including the 
minimisation of word length. Use of MSI and LSI is 
helped by the simplicity of the serial design. Using 
15-bit fixed point, the theoretical performance of a 

72 x 128 array is about 108 multiplications or 109 
additions per second. Comparisons are made with other 
architectures. 


Meteorology is considered as an application. It is 
attractive to have the whole problem in the array 
storage. 


1. INTRODUCTION 


This paper describes a design study of an array of 
elements that can be used either as a "Single- 
Instruction, Multiple-Data stream" (SIMD) processor or 
as a store. Architectural features of interest are: 
(a) the use of serial arithmetic to simplify processor 
logic and optimise store utilisation; (b) an attempt 
to avoid 1/0 bottlenecks by mapping complete problems 
into the array, without relying on overlay techniques; 
(c) provision for using all or part of the array as a 
store when not performing its specialised processing 
functions; (d) the close integration of storage and 
logic. 


The main attractions of array-type SIMD structures are: 
(a) high absolute performance on certain problems of 
importance; (b) high performance/cost, partly result- 
ing from using common control logic. 


Several examples of this type of architecture have been 
proposed (1-8) and applications have been suggested in, 
for example, meteorology, plasma physics and linear 
programming. Most structures have a single control 
unit that broadcasts instructions to a regular array of 
processing elements (PEs) each with individual storage 
and an arithmetic unit (AU). 


Flynn (2) points out four factors that degrade the 
performance from the theoretical figure given by 
"Number of PEs times PE performance": (a) Each PE 
has direct access only to a limited region of store, 
and excess time may be taken accessing other regions; 
(b) Mapping the problem onto the array may leave some 
PEs unused; (c) Owing to overheads in preparing in- 
structions for the array, there may be times when the 
whole array is idle; (d) While dealing with singular- 
ities or boundary conditions the majority of PEs are 
idle. 


These factors are acknowledged to reduce the applica- 
bility of such an array. In the present design att- 
empts have been made to mitigate their effect, but the 
over-riding consideration has been to simplify the PE 
design; this has been done to the extent that the 


theoretical performance is very high, in spite of the 
AU cost being small compared with that of the storage. 
In effect, therefore, the store is being adapted to an 
array processing function. This may be contrasted 
with attempts to adapt the processor to array operations 
(e.g. CDC STAR). 


A dispersed system, i.e. one with many PEs each with 
local memory, has potential cost and speed advantages 
deriving from: (a) reduced "cable" delays; (b) re- 
duced address transforming and checking; (c) faster 
actual access; (d) simplified data routing and priority 
logic. 


A number of potential PE designs of varying parallelism 
have been considered for building arrays of the same 
theoretical performance, with the following general 
results. 


The gate count varies with the degree of internal PE 
parallelism. A purely serial PE has considerable 
advantages particularly for low precision work. 


Serial PEs have fewer connections at all packaging 
levels. 


The extreme simplicity of serial PEs permits the very 
effective use of batch fabrication and testing techniques 
and keeps hardware development rapid and cheap. The 
small number of circuit and board types helps develop- 
ment, production, spares holding and maintenance. 


Serial designs have exceptional functional flexibility; 
very few decisions are built into the hardware. However, 
fully indexed addressing is expensive. 


The design is somewhat similar to SOLOMON 1 (8); the 
main differences stem from the exploitation of modem 
technology. 


26 THE ARRAY 


2.1 CONFIGURATION 


FIGURE 1.  M.C.U. DIAGRAM 
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Figure 1 is an overall configuration diagram. The 
rectangular array has an essentially two dimensional 
nearest neighbour connectivity, and has one dimension 
matched to the store highway of a conventional computer 
(the "parent" machine). This connection provides the 
route for loading both data and array instructions into 
the array storage for array processing; it also permits 
the parent machine to use the array storage as its own 
main storage. Input/output is done by the parent 
machine. 


The Main Control Unit (MCU) has: (a) a conventional 
instruction fetching arrangement; (b) an instruction 
buffer whose purpose will be described later; and (c) 
a set of registers, many of which can be matched to the 
array by row or colum for a variety of purposes, one 
of which is indexing. For sizable arrays the MCU is 

a very small fraction of the total hardware. 


After loading, the bits of a word are spread along a 
column of PEs, and this method of holding data is termed 
Main Store mode. Another method, termed Array mode, 
stores all the bits of a word in a single PE. This is 


more attractive for processing large arrays, but requires 


initial and final transformation of the data from and 
to Main Store mode; this is done inside the array. 


2.2 THE PE 


FIGURE 2. 
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Figure 2 is a PE diagram. The registers are all one- 
bit; P and Q are for operands, C is the carry register, 
Ail and A2 are activity bits that can prevent writing to 


store, and B1 and B2 can supply 2 address bits. The 
routing multiplexor can select a bit from the PE's own 
store, or from a neighbour's store, for writing to a 
register; selecting zero and controlling its inversion 
permits data input from outside the array (for example, 
an MCU register). The sum, carry, data input or con- 
tents of Q can be output from the logic, usually to the 
store. The store contents can be output externally 
(to, for example, an MCU register) via the gates at the 
bottom of Figure 23; the bits output can be either from 
a selected column of PEs, or the logical AND of rows 
(or columns) of PEs. One use for the latter is for a 
test over all PEs. 


The fifth "neighbour" connection is to the PH half a 
row away in the same row; this permits both faster 
mass movement of data around the array, and a Wop 

PE geometry. Bit patterns in one or two MCU registers 
can be applied to the "inversion" inputs to produce a 
veto selective by rows and/or columns on writing to PE 
stores. Figure 2 shows 4 address bits capable of 
being selected by row or column; what indexing 
facilities should be provided is still an area of 
debate. 


Some differences from the PE in (7) are: es more 
row/column symmetry; (b) a latch feature (shown on 
the P register) for associative comparisons; (c) data 
can be shifted directly between PEs without using the 
store; +8} input data can be loaded directly into 
store; e) there is a ripple carry path between 

PEs for Main Store mode arithmetic; 
store is now 4K instead of 2K. 


(f) the bipolar 


It is intended to package 2 PEs minus their stores and 
routing multiplexors in one 24 pin integrated circuit. 


2.5 EDGE CONNECTIONS 

For instructions that involve neighbours, it is the 
array geometry that determines what happens at the 
array edges. Rows or columns may be: (a) cyclic, 
with their ends connected together; (b) linear, with 
a continuation onto a neighbouring line; (c) as (b) 
but with the extreme ends connected; or (d) plane, 
with external data applied at the relevant edge. In 
addition, a row may be considered in two halves (23D 
geometry). There are thus 32 geometries, and they are 
set by program. 


2.4 CONSTRUCTION 

A board would contain a 6 x 4 PE section with 4K bits/ 
PE; there would be 147 external connections and 173 
ICs, 96 of them for storage. The array can be viewed 
as doing processing in the store, and costs only about 
25% more than ordinary storage made out of the same 
technology. A platter would contain a 36 x 16 PE 
section; the number 36, and multiples of it, match 
standard store highways. "Folding" of the array 
makes connections between the extreme edges short. 


The economy obtained by the dense packing of the 
integrated circuits is the result of the favourable 
marriage of space—limited (or power-dissipation 
limited) storage and pin-limited logic. 


2.5 TIMING 


Because most micro-instructions do not involve a 


response from the array, the equalisation, rather than 
minimisation, of delays is important. Even with a 


comparatively slow logic technolo the micro- 
instruction rate should be about al MHz; the storage 


element delays are the biggest factor, and this illus- 
trates how the array can exploit bipolar store speeds, 
unlike a large conventional machine. 


2.6 FUNCTIONS 

In (7) the basis of the micro-programming notation is 
given and it is shown how Array mode fixed and floating 
point instructions are built—up. Bit organisation 
means that only necessary work need be done; for 
example, multiplication only needs to calculate a 
single length result. 


Code for execution must be compiled down to the one-bit 
micro-instructions, except that for working regularly 
along the bits of words a short loop can be constructed. 
This loop is held in the instruction buffer, so that no 
further instruction fetching from the array storage is 
needed during execution of the loop. This feature 
reduces the instruction fetching overhead from 100% to 
about 20%. Subroutine construction will be possible. 


2.e/ PERFORMANCE 
For array mode, fractional fixed point multiplication 
takes about 
n (4n + 1 
2 


micro-instructions where n is the word length; 
point addition takes little more than 4n micro- 
instructions. Floating point takes a little longer 
for multiplication, and considerably longer for 
addition (see (7) ). 20-bit multiplication takes 
about 730 micro-instructions plus about 160 cycles for 
micro-instruction fetching, and at 54 MHz would take 
about 160 psec; 20-bit addition takes about 12 psec. 
Multiplication of an array by a common number can be 
about four times faster. 


fixed 


Main store mode arithmetic is faster than Array mode 
for smaller arrays. In terms of absolute speed, 
addition is about 11 times faster and multiplication, 
using a carry save technique ending with a ripple carry, 
is about six times faster for 20 bit precision (the 
latter factor increases with the precision). 


FIGURE 3. D.A.P. PERFORMANCE 


1000 


MIPS) MULTIPLICATION / 72 x 128 
BIASED PE ARRAY 


ARRAY MODE 


(20-BIT WORDS) 


100% Le SI SAA Ot Sa ee ee re Sh 
~ 
yoo STORE MODE 
7 


7 


a 
VECTOR MACHINE, 
‘y 


1000 104 105 10° 


PARALLEL DATA STREAMS 


The user has three modes of working at his disposal: 
the parent machine for scalar working, Main Store mode 
for small arrays and Array mode for large arrays. 
Figure 3 shows roughly what is possible in the three 


modes; the useful processing rate in Million Instruc- 
tions (or, more accurately, results) Per Second (MIPS) 
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is plotted against the number of parallel data streams 
for the type of computing indicated and a 9200 PE array. 
Only the top ends of the sloping lines depend on array 
size. The dashed line shows the similar graph for a 
powerful vector machine (there are many other differ- 
ences between the two types of machine). 


The overall performance depends on the application and 
programmer skill. 


2.8 A COMPARTSON 

ILLIAC IV is a well known machine, so a brief com- 
parison is attempted with Array mode, assuming the 
problem parallelism is sufficient to occupy either 
machine. Many differences are not easily quantifiable, 
but as a starting point the main assumptions for a 
numerical comparison are given in Figure 4. The 

first four lines give the instruction mix; B is the 
number of bits precision for the serial design, which 
has no separate store acesses because all functions are 
store-to-store. P is the clock period (180 nsec). 

20% is subtracted from the ILLIAC IV totals to allow 
for instruction overlap. 
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Figure 5 compares the hardware required to build an 
array of given performance for words of a particular 
precision. Logic and storage have equal weight; 
Figure 4 gives the gates/PE ratio and the storage 
comparison involves an estimate of the unnecessary 
bits in the ILLIAC IV word. The graph would favour 
ILLIAC IV only for working exclusively with 46-49 bit 
precision. At low precisions serial PEs have a very 
big advantage. 


Such numerical comparisons are of only limited value. 
For example, the vertical scale of Figure 5 would be 
multiplied by about 4 if integrated circuit count were 
used as a hardware measure. Other factors such as 
hardware simplicity and repetition, pin counts and 
functional flexibility are equally important. 


2-9 EXAMPLE OF STORAGE ECONOMY 


For problems with large amounts of data, storage 
economy is important, particularly if it permits 
storing the complete problem in the array. The user 
can apply various tricks. As an example, consider 
three dimensional field problems. In order to prevent 
physical "truncation" errors, programs are designed so 
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that differences between neighbouring variables require 
fewer significant bits than the variables themselves. 
If variables have to be held simultaneously for two 
time steps, then, for example, they can be grouped into 
sets of 16 nearest neighbours in space and time (2 x 2 
x 2 x 2), and held as follows: (a) a short floating 
point number close to the maximum of the group (maybe 
a 4-bit mantissa and 3-bit exponent); and (b) 16 
differences in block floating point (maybe 12-bit 
mantissas and a common 2—bit block exponent). This 
results in 12.6 bits/variable and is roughly equivalent 
to floating point with a 15-bit mantissa and 3—-bit 
exponent, i.e. a gain of nearly 50%; other machines 
require floating point variables to occupy up to 64 
bits, i.e. up to 5 times more. 


36 METEOROLOGY AS AN APPLICATION 


This is considered more fully in (7). Meteorology 
includes both simulation experiments and forecasting, 
and as simulation programs are central to both, atten- 
tion will be confined to them. (Forecasting also 
uses analysis and initialisation programs to assimilate 
the "real" data). For simulation programs, the fre- 
quency of add/subtract and multiply instructions is 
roughly equal, and divide is much less frequent. For 
DAP, multiplication takes much longer than addition, 
so the number of multiplications and their timing give 
a first approximation to the speed of a program. 


The table gives a rough guide to parameters in use today 


and those that should be aimed at. 


Using the 18 bit (fixed point) precision suggested in 
section 3.4, each PE can perform a multiplication in 
about 140 psec. section 3.2 discusses the efficiency 
of PE usage; 50% might be a reasonable figure. ‘Thus 
about 8000 PEs are adequate to perform the 2.5 x 107 
multiplications per second indicated above. 


Present 
Number of 
Vertical 


Forecast {Global Next stage 
Programs |Research 
Programs 
Columns of 


Grid Points x 4 


Number of m2 
“vertical 
levels 

Total number x8 (1.6 x 10°) 
of variables 


Time step ae 


Number of x 3 
time steps 


Multiplications 
per column per 
time step 


x. 2.5 


x20 (2.5 x 10')| 
sec. 


50-100. 


Speed-up over 
real time 
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It may be tempting to use a backing store for big 
problems; however, the smaller the array storage the 
larger is the channel capacity required. In (7) an 
example was studied of a problem using explicit 
integration which had 1.5 x 10° variables of average 
length 20 bits, and was processed on an 8200 PE array 
with an I/O chamnel of 10/ bits/sec. Three formula- 
tions of the problem had the following trade-offs: 
(a) 1850 bits/PE and speed degraded by a factor of 
2.5, (b) 2800 bits/PE and speed degraded by 1.3, and 
(c) 4600 bits/PE, the complete problem in the array 
and no degradation. A similar problem using implicit 
methods would have its speed degraded by an order of 
magnitude if a backing store was used. 


This sort of problem needs about 5-10 x 10! bits of 
storage. The falling cost of semi-conductor storage 
makes this amount of array storage feasible, and the 
simplicity and reliability of a unified semi-conductor 
system makes it attractive. Partly for these reasons, 
the array has more resources devoted to storage than 
to logic. 


53.2 PARALLELISM 


Efficiency, defined as the fraction of time a PE is 


active, depends on programmer skill as well as the 


problem. Numerical procedures used at present have 
usually been devised with serial machines in mind, 
and sometimes a slightly different procedure may be 
much more efficient. 


Explicit methods for the "basic" meteorological 
equations are efficient. Boundaries do not have much 
effect because it is usually a case of omitting things. 
"Secondary" effects may cause efficiency to drop. The 
computation is different if the air is saturated. 


Convection may require the checking of neighbouring 
vertical layers for stability, followed by a relaxation 
process. Study indicates that these effects need not 
have a major effect on the overall efficiency. 


Once various conditions have been established "branch-— 
ing" by means of activity bits is very rapid, and can 

be done frequently in order to improve parallelisn. 

(A conditional branch in a conventional program loop, 

or selection in a vector machine, are slow by comparison). 


Implicit methods involve either ADI (alternating direc- 
tion implicit) or relaxation methods; the former are 
not particularly efficient but the latter are. 


There seem to be 4 types of grid in use: (a) rect- 
angular for fairly local forecasts; (b) octagonal in 
overall shape (rectangular neighbour connection) for 
the northern hemisphere; (c) cylindrical on a global 
latitude-longitude basis; (d) as (c) except that the 
number of points on a line of latitude is reduced as 
the poles are approached. a and (c) can fit a rect- 
angular PE array. (b) and (d) would waste some of the 
PEs. c) has reduced efficiency because a smoothing 
process is applied more times near the poles; this can 
be viewed as a trade-off for the wasted PEs of (4). 


4.3 PRECISION AND NUMBER REPRESENTATION 

Precision costs time and storage space, so that big 
problems should use only the minimum consistent with 
accumulated round-off error being small compared with 
other errors. Different variables can use different 
number representations and precisions. Knowledge of 
requirements is only patchy, but should improve; the 
pay-off, compared with fairly cautious starting schemes, 
might be a factor of about 1.5 in storage and 2 in 
speed. 


Meteorology is largely concerned with absolute rather 
than relative accuracy, and the maximum possible values 
of variables are well understood; this points to either 
fractional fixed point or a simple floating point. 
Block-floating of arrays (9) can also be implemented 
efficiently. 


An example of possible economy in space and speed 
occurs in explicit integration schemes; the increments 
to variables require considerably less precision than 
the full variables. 


Careful choice of rounding method in order to avoid bias 
can also lead to economy (7). 


A reasonable estimate of the average precision required 
for fractional fixed point variables might be 18 bits 
and rather less for the mantissa of floating point 
variables. 


4. OTHER APPLICATIONS 


An algorithm to solve the two dimensional Poisson's 
equation was studied. It used a Fast Fourier Trans- 
form technique, but the extensive data shuffling that 
this involved occupied only 20-25% of the time. There 
was also reduced parallelism in places, and a typical 
PE was idle about 50% of the time. ..On a 72 x 64 PE 
array, a 256 x 256 mesh was estimated to take 50 msec 
for 20-bit numbers; this compares very favourably with 
conventional machines. An interesting aspect is that 
the main array is held in Array mode and certain row and 
column features are dealt with in Main Store mode; Main 


in single arithmetic operations. 
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For the array to be useful, problems must fulfil 
three conditions: (a) Processing, as opposed to 1/0, 
must be important; (b) Much of the problem must be 
programmed with parallel and identical operations 
(these may, however, be selective); (c) Excessive 
time should not be spent shuffling data round the 
array. (In some cases this means the data should 
be fairly regular). 


These requirements are not very severe, and the biggest 
barrier to widespread use is likely to be in devising 
an acceptable programming language. (In spite of 
many problems being naturally parallel, many users 

are indoctrinated by sequential thinking). 


Some applications for array processors are discussed 
in (5). Further applications are suggested by the 
fact that the array can be used as an "associative 
processor"; examples might be air traffic control, 
graphics processing and symbol processing. Associative 
information retrieval can look attractive over quite a 
wide range of parameters; with the associative latch, 
each PE can scan 1 bit ey micro-instruction, and so 
10 000 PEs can scan 5 x 1019 pits/second. 


The user has the freedom to optimise and experiment 
from the bit level upwards; this may help him under- 
stand his real computing requirements. The array is 
not arithmetic biased, and the functional flexibility 
permits functions to be tailored for all sorts of 
purposes. The hardware simplicity permits parameters 
such as the number of bits/PE and the type of storage 
to be varied easily; for example, a slower, cheaper 
MOS version would extend the range of applications 
considerably. The array modularity (almost like 
storage nodularity) means that sizes from 500 to 

40 O00 PEs are reasonable. 
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MAXIMAL RATE PIPELINED SOLUTIONS 
TO RECURRENCE PROBLEMS 
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ABSTRACT 


An ia order recurrence problem is defined as the 
computation of Xj, ... Xy, where X; = f(aj, Xj-1,... 
Xj-m) and a; is a set of parameters. Ona pipelined com- 
puter, where the total stage delay in computing f is dg time 
units, the solution output rate is one new Xj each d¢ time 
unit. This paper describes a method for increasing this 
rate to 1 per time unit when the function f has certain simple 
functional properties. The total stage delay and complexity 
of the resulting pipelines are also described, 


I, INTRODUCTION 


An a order recurrence problem is defined as the 
computation of the sequence Xj, . . . Xy given only 


x Ie Initial conditions Xo, Xai; o © @9 Xi-m 


2. "parameter vectors''a;, ..., an, where each 
a; is a collection of solution-independent param- 
eters 


3. a''recurrence function " f, 


such that for eachi, 1< i<N, 


KX, =f@, X,- ++) X) (1) 


i-m 


An example is the mt order linear recurrence 


a. (r) Xr + a. (m+1) (2) 
r=1 


A pipelined computing device is one that accepts inputs 
at a rate of one every r units of time and produces corres- 
ponding outputs p time units later, p>r. Up tof p/ r|* sep- 
arate computations can be active within the pipeline at one 
time. For this paper, r = 1, and thus a pipeline may be con- 
sidered a series of p independent ''stages, '' each capable of 
holding a partial computation on a distinct set of. inputs. 


Assuming that the function f is computable by a pipe- 
lined device with dp stages, a direct solution of a recurrence 
problem is pictured in Illustration 1. Assuming that Xj-_ 4 
is output at time j, X;, which depends on Xj-j, cannot be 
output until X;_1 has cycled through the entire dr stages of 
the pipeline; i.e., until time j+ dg. Mlustration 2 diagrams 
the timing of the pipeline. Thus the output rate is at most 
one element of the sequence per dg time units, 


The purpose of this paper is to investigate the condi- 
tions under which pipelined networks can be configured to 
have data rates higher than 1/dp, up to 1 sequence ele- 
ment per time unit. Section II describes a simple 
example of this procedure. Section III details some 


* [ x. is the smallest integer not smaller than x. 
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conditions under which the performance of pipelined solu- 
tions to first-order recurrence problems can be increased. 
Section IV generalizes this to mth_order recurrences. In 
all sections, both total pipeline stage length and pipeline 
complexity are discussed. 


The basic background for this paper originates in a 


series of earlier papers on the solution of recurrence prob- 
lems on parallel computers (1, 2,3), 


ILLUSTRATION 1 


Direct Implementation of Recurrence 


Parameter 


Vectors Buffers 


ILLUSTRATION 2 


Timin 


ee eee ok a eee: 6 eee ee 


Input 


Stages 


Il, AN EXAMPLE 


One of the simplest nontrivial recurrence problems in- 
volves a recurrence equation of the form Xj = a;Xj_1, where 
a; is a real number expressed in floating point notation. The 


function f in this case is multiplication, a typical implemen- 


tation of which might involve a two-stage pipe, one stage for 
exponent addition and one stage for mantissa multiplication. 
With such an implementation, a direct solution like Ilus- 
tration 1 would have an output rate of 1/2 -- 1 new Xj every 
other time unit, The pipelined nature of the multiplier is 
not exploited. 


However, the basic recurrence can be rewritten as 


eee: x (3) 


for any value of q, Each X; in this case requires q +1 num- 
bers to be multiplied, Using the well known "log reduction" 
technique ys however, multiple multipliers can be arranged 
in a tree-like arrangement that computes equation 3, and 
requires at most d(q) =[ logs qt1] multiplier delays (com- 
pare Illustrations 3 and 4). If Xj-g, for example, is avail- 
able from such a network at time j, then X; can be computed 
by time j + d(q). This places an upper bound on the output of 
q different X's (Xj-q+1, - - - Xj) in time d(q) as an output 
rate of q/d(q). This rate is maximized to 1 -- a distinct X; 
in each time unit -- if q > d(q). For our example, this 
relation is 


q > 2[ log, q+1] (4) 


which occurs for q 2 6. [lustration 3 diagrams a log- 
product pipeline solution of equation 2 for q = 3; the output 
rate is 3/4, a factor of 1.5 better than the direct implemen- 
tation but still not maximal. Illustration 4 diagrams a max- 
imal flow pipeline for q = 6. 


ILLUSTRATION 3 


Log Preduct Pipeline for gq = 3 
Buffers (Single Unit Delays) 


Total Stage Delay 
Output Rate 
Speed Up 
Complexity 


3/4 
3/2 
3 Multipliers 


wow ou ou 


Several comments should be made in respect to the 
maximal rate pipeline of Illustration 4: 


1, Buffering is used to equalize the delays in all sec- 
tions of the pipeline to 6 time units, 


2. At each time unit, all buffers and all stages of all 
multipliers are computing products that will lead to 
some element of the solution sequence; i.e., the 
pipeline is fully loaded. 


3. The multipliers labeled M1,2 and M1,3 are redun- 
dant in that any calculations they perform were per- 
formed earlier by M1,1. 


The redundant M1,2 and Mi,3 can be removed by moving 
the buffers B2 - B5 from the inputs to M1,2 and M1,3, and 
placing them on the output of M1,1, as shown in Illustration 
5. This pipeline still exhibits maximal flow, but involves no 
redundant computations. 


ILLUSTRATION 4 


Maximal Rate Log Product Pipeline 


Total Stage Delay 
Output Rate 
Speed Up 
Complexity 


= 6 Multipliers Output 


ILLUSTRATION 5 


Maximal Rate Log Product Pipeline Without Redundancy 
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Ill. FIRST-ORDER RECURRENCE 

The key behind the applicability of the log-reduction 
techniques on the example of the last section was the asso- 
ciativity of the recurrence function multiplication. Although 
many recurrence functions are associative, and are solvable 
in a manner identical to that used above, most of the more 
common recurrences are not, As detailed in earlier papers, 
however, a large class of problems, particularly first-order 
problems, have a property similar to associativity that is as 
useful in configuring maximal rate pipelines (1: 2,3). This 
property is termed "semi-associativity, '' and is defined as 
follows: 


DEFINITION 1 


A recurrence function, f, is said to be semi-associative 
with respect to a companion function, g, if there exists a 
function g such that for all parameter vectors a and b and all 
x's: 

f(a, f(b, x)) = f(g(a, b), x) (5) 
An easily provided corollary to this definition is that 


with respect to its effects on f, the companion function, g, 
is associative. 


Corollary 1 
For all parameter vectors a, b, andc, and all x: 


f(g(a, g(b, c)), x) = f(g(g(a, b),c), x) (6) 
Examples of recurrences that have a companion function 
are: 
ae Se a 


(7) 


(8) 


2 a; (1) 
sR Bl ca Oe 


a ag ae) 


X, rere ene emenee 
ia, @)x, ,+a,4) 


(9) 


In the following descriptions, it is assumed that pipelined 
computing modules can be built for both f and g, and the 
number of inherent stage delays is dr and d,» respectively. 


The existence of a companion function allows a first- 
order recurrence 


x, = f(a, x, _4) (10) 
to be placed in the following form for any q 
si = f (g( eo ¢« @ g(a. a4) 9 @©@ee Ai gti)? Xa) (11) 


The associativity of g with respect to f allows a log- 
reduction network to compute the g composition portion of 
equation 11 in [ loge q | g computation delays. 


The output of the final g module drives the module that 
computes f, as pictured in Illustretion 6. Again buffers are 
used to synchronize the arrival of data at each module. The 
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total delay through this pipeline is thus 


dia) = 4, +d, [ log, a] (12) 
Again a maximum rate pipeline requires that 
dq) =d-+d[log, a] <a (13) 


ILLUSTRATION 6 


Pipelined Computation of Xj = f(aj, Xj-1) 


Each Buffer 
has a Delay of 1 


As with Illustration 4, many of the g modules in Illustra- 
tion 6 are redundant in that the computations they perform 
are identical to computations performed several time units 
earlier by some other module. Consequently, they can be 
replaced by buffers that simply delay the output from the 
other module by the appropriate amount. If q is chosen to be 
the minimum integer power of two that satisfies equation 13, 
then this technique of substituting buffers for g modules re- 
duces a network like Illustration 6, containing q-1g modules, 
into one like Illustration 7, which uses only [| logs q | mod- 
ules. 


Table 1 summarizes, as a function of de and d,, the 
minimum q (qmin) that satisfies equation 13, the correspond- 
ing d(q), the minimum q that is a power of two (2 loge Amin ]}), 
and the number of g modules ([ logs q |). 


IV. o"..oRDER RECURRENCES 


An m‘"-order recurrence, m > 1, has the form 


My Mie age So Aiea 8) 


To speed up a pipeline computing this type of recurrence, we 
want to express X; as / 


= * 
K=f@*, Xe X ) (15) 


i-q-m+l 


ILLUSTRATION 7 


Minimal Complexity Pipeline for Xj = f(aj, Xj_-1) 


aj 


Sat 


pal 
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where the time to compute a.* from the original a.'s grows 
less rapidly than, and eventually is smaller than q. In the 
mth-order case, no simple associative or semi-associative 
companion function is possible; the number of arguments 
(22) is too large. However, as was shown in earlier re- 
ports, many common recurrence functions have a related 
pair of functions that do allow the construction of networks 
with the desired characteristics(1»2), These are defined as 
follows: 


DEFINITION 2 
A recurrence function, f, is said to have a companion 


set (g,h) if there exists functions g and h such that for all 
parameter vectors a,, . . ., 4, and all X's, Xj... Xp: 


f(a; f(a,, Xo» oe x)? x 
= f(g@), a,)) X,- +--+ X,) (16) 
CG 5 FO Kogan He dwd oe CO pb, 0r9 eS) 


eS (ee Pe x) (17) 


}’~ e e 


As an example, the m'9_order linear recurrence (equation 2) 
has the following companion set (where g(a, b) (j) stands for 


TABLE 1 


Complexity of Pipelines for First-Order Recurrences 
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jt® component of the parameter vector g(a, b)): 


a(1)b(j) + a(j+1) 1<j<m-l 


g(a,b) (j) = 4 a(1)b(m) j=m (18) 
a(m+1) + a(1)b(m+1) j= m+tl 
m 
dy ara.) i=jam 
: r=1 
hay... a di = (19) 


m 
> a (r)a (m+1) +a (mt1) j = mt 
=o '=y =o 


The utility of companion functions comes from the fol- 
lowing theorem (proved in reference 1): 


THEOREM 2 


For any K 20, X, can be expressed in terms of Xi-q(k) 
as follows 


ee eS 
X; = fa; *, Xia (ky? ae Xs atk) : asad) (20) 
where 
q(k) = m2*+i-m (21) 


(k) 


and a. is computed from the following recurrence; 
(0) _ 

2; 7 (22) 
a) na", Ad-ai), m-1, A(-a(y-1, m-2),..., 
A(i-q(k) - m+1, 0)) (23) 

A(i, j) = g(g(. « .ga,"”, a. ), a. )o ve ey 

1 © ~i-q(ky" ~i-q(k)-1 

(24) 


a, ‘ 
—i-q (k)-j+1) 


_ IWustration 8 diagrams a typical network for computing 
a;(&+1) from a, (4), Since the function g is not usually asso- 
ciative, or even semi-associative, no rearrangement or re- 
duction in the number of g modules is generally possible. 


The total delay in Illustration 8 is thus: 


(M-1) os +d, time units (25) 


h 


To build a pipeline that computes a, (6) directly from the 
a,'s, the network of Illustration 8 must be cascaded into K 
levels. As with the earlier pipelines, however, only one 
copy of Illustration 8 is needed at each level. Additional 
buffers are used to save redundant computations and syn- 
chronize the arrival of the proper inputs. Illustration 9 dia- 
grams such a pipeline. 


For a K-level pipeline, like Illustration 9, the total delay 
through the pipeline is simply K times the delay of a single 
network (equation 5) plus the delay to compute f: 


d(K) = K((m-1) d. $a jrd, (26) 


Again, for a maximal rate pipeline, this delay must be less 
than q(k), equation 21; i.e., a K must be found such that 


K((m-1) d,. + dy},) + dg m2** -m+1 (27) 


ILLUSTRATION 8 


One Level in the Computation of a;(k) 
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ILLUSTRATION 9 


Pipelined Computation of mth_order Recurrence 
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Once this minimal value of K has been determined, the 
complexity of the required pipeline can be computed directly 
from Illustrations 8 and 9, as shown in Table 2. 

TABLE 2 


Complexity of Pipelines for m'"_order Recurrences 


Type of Module Number Required* 


f 
h 
g 


*K is smallest positive integer that satisfies 
equation 27. 


V. CONCLUSIONS 


This paper has discussed methods of speeding up pipe- 
lined computation of recurrence problems where feedback is 
present; that is, where the computation of one element of the 
desired sequence cannot be started before some earlier ele- 
ment has been fully computed. The methods discussed basi- 
cally involve rewriting the recurrence so that X; depends on 


e eo 9 


Hs gik)? ° ak) - mt1 (28) 
and some computable parameter vector a.(K), For many 
recurrence problems, the time to compute aj(k) grows much 
less rapidly with k than does q(k). In such circumstances, 
for large enough k, the total time to compute X; from 
Xi-q(k)» ++ > is less than q(k), allowing the output of the 
resulting pipeline to be fed directly back into the input and 
yet still maintain a fully utilized pipeline that outputs a new 
X; during each time unit. This pipeline is then running at 
the maximum possible rate. 


One question that has not been discussed in detail in this 
paper is the problem of initializing the pipeline. The most 
direct techniques would be simply to precompute enough 
X;'s and a(J)'s to fully initialize all stages in the pipeline 
(perhaps using parts of the same pipeline at less than maxi- 
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mal rate). Once this is done, the pipeline can be allowed to 
run normally, This is a time-consuming process which, if 
the pipeline is long enough, may negate many of the benefits 
of the maximal rate pipeline once it is started. For some 
specific problems, however, this process may be avoidable 
by introducing special values for a;'s and X;'s. For example, 
in Illustration 5, if the input to B6 is held to 1 for the first 

6 time units, and B1 initially loaded with 1, the pipeline will 
output X, at time 6, and run normally after that. 


The applicability of these speedup techniques depends in 
large measure on the particular problem being solved, the 
length of the desired solution sequence, and the stage delays 
in the basic f, g, and h computing modules. For problems 
where the modules have large stage delays, for example, the 
potential maximum speedup is significant, but the value of k 
required to attain that speedup may result in a very long and 
complex pipeline, where the time to initialize the pipeline 
becomes a significant fraction of the total computation time. 
In such cases, some kind of iterative tradeoff between chang- 
ing module stage delays, accepting less than maximal output 
rates, and initializing the pipeline may be necessary. 
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ABSTRACT 


In this paper we examine the capabilities and limita- 
tions of Petri nets and investigate techniques for prov- 
ing their correctness. We define different classes of 
nets where each is basically a Petri net with slight 
modifications and study the relationship between the 
various classes. One particular class appears to be 
quite powerful, with respect to its capability for 
representing coordinations. In the second part of the 
paper we establish the feasibility of using the methods 
of computational induction and inductive assertions to 
prove restricted statements about Petri nets. 


I. INTRODUCTION 


Petri nets are being widely used in the design, speci- 
fication and evaluation of computer systems [1,7], and 
in the modeling of production [3] and legal [6] systems. 
They also appear to be a neat, clear and convenient way 
to express process coordination. Naturally, the ques- 
tion about capabilities and limitations of these nets 
arises. It has been shown [4] that there are problems 
where the desired coordination cannot be expressed 
using Petri nets. In the first part of this report we 
introduce different classes of nets. Each class is 
basically a Petri net with slight modifications. We 
then examine the relationship between the various class- 
es in the hope that this will give us some insight into 
the capabilities and limitations of Petri nets. 


In the second part we are concerned with proving asser- 
tions about Petri nets. Given a coordination problem 
and a Petri net it should be possible to convince one- 
self that the Petri net does in fact represent the de- 
sired coordination correctly. Techniques for proving 
any given Petri net correct, will help in proving the 
correctness of general parallel systems since it may 

be possible translate the system mechanically into a 
Petri net where it is easier to see what is going on. 


II. CAPABILITIES AND LIMITATIONS 


We assume that the reader is familiar with Petri nets 
and concepts such as liveness, safety, etc. However, 
for the sake of avoiding ambiguity we will define a 
Petri net and give the simulation rules explicitly. 


A Petri net N is a directed graph defined as a quad- 
ruplet (T,P,A,M°) where, 

T = {t,, Shas tt is a finite set of transitions 

P = {p> Ske Pt is a finite set of places 

(IT, P form the nodes of the graph) 

A = ta,» Savers a, } is a finite set of directed arcs 
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of the form (x,y) which either connect a transition to 

a place or a place to a transition. Each place may have 
one or more markers in it or it may be empty. A place 
is full if it has at least one marker. 


M° = {(p,n) | peP and ne {0,1,2, ...}} 


(a function from P to {0,1,2, ...}) is the initial mark- 


ing. 
Simulation Rules 


Given a certain marking M of a net, if all the input 
places to a transition are full the transition is said 
to be enabled in M. An enabled transition may at some 
Stage decide to fire. At this stage it reserves a mark- 
er in each input place and starts firing. At the com- 
pletion of firing it removes the reserved markers and 
places a marker in each output place, giving a new mark- 


ing M'. We say that the firing of t, in M results in 
M'. As soon as a marker is reserved it becomes invis- 
ible to all other transitions. 
ee a See € T* 
by bo D 


is said to be a simulation sequence of a net N = (T,P, 
A,M°) if there exists a sequence of markings M’, 


M such that th is enabled in M+*7 


Mi-l results in Mi, for all ie {1,2, ..., n}. The 

set of all simulation sequences of N is called the simu 

lation set of N or SIMSET,. Let T'CT. Then for each 

simulation sequence t = t, , ..., t, of N we define a 
by by 


reduced simulation sequence e° = t 


ere Cc 
respect to T', where €' is the sequence that eesuies 
when all t, € T - T' are excluded from t. SIMSET |T' 


9¢°¢e0e 09 


and firing of th in 
i 


+ eee t with 


is the set éf all reduced simulation sequences of N 
Two Petri nets Ny = (Ty, Py, AL; 


with respect to T'. 


So far, we assumed that the transitions of Petri nets 
had distinct labels. We now define an interpretation 
I [T',E] of a Petri net N = (T,P,A,M°) as follows: 

T' = {t, a } C Tis a set of transitions, 
AL: m 
E = {E, >» ets ES ; 
is a set of event or process names and I: T' +E, i.e. 
I is a function from T' onto E. Thus, the same event 
or process name may be attached to different transitions 
and the same net may represent different coordinations 
depending on the interpretation given to it. Given a 
net N= (T,P,A,M°) and an interpretation I [T', E], for 


k< m 


each reduced simulation sequence t with re- 


by’ e@eesy Ybe 
spect to T' we get an interpreted simulation sequence 
oe Eas aie with respect to I where I (ty) = 
E.. for 1<i<m. The set of all interpreted sequenc- 
estof N with respect to I is called I [SIMSET,]. A net 
N, with an interpretation I, [T', E] is weakly 
equivalent to a net Ny with interpretation I [T'',E] 
if Ij [SIMSETy,] = I, [SIMSETy, |] and in this case we 


Tage 


In what follows we will define different classes of 
nets where each kind is basically a Petri net with 
slight modifications. 
can be appropriately defined for each class. If TN and 
TN, refer to two different classes of nets, then PN and 
PN, refer to all the coordinations representable by IN 
and TN. respectively. We say that PN C PN, if for 
every € IN and interpretation I [T',E] there exists 
an N, € IN, and interpretation I, [T'',E] such that 


I,Iy 
N = Ny. 
Thus PNC PN, if PN © PN, and there exists a net 
Nx € TN, and an interpretation I, [T',E] such that 
there is no net N € TN and interpretation I [T'',E] 
with 
I,Ix 


x° 
Classes of Nets 


1. Let the class of ordinary Petri nets be TN. 

2. The transitions in ordinary Petri nets are enabled 

only when all the input places are full and we can con- 
sider these transitions to have an AND-input logic. If 
in addition, we allow transitions with OR input logic, 


we call the class of nets TN 
log 


ty 


(letting P; denote the number of markers in pj) tj is 
enabled if and only if [(P}> 0) .A (P9> 0)]A [(P3> 0) 
V (Pg 7 0)]. Thus tj is enabled even if all the input 
places do not have markers and when it starts firing 

it reserves a marker in each input place that has at 
least one. 

3. In addition to the ordinary transitions in the nets 
belonging to TN we allow a transition to have input 
places and arcs of a special kind. The transitions 
allowed are of the form: 


SIMSET, SIMSET | T and I [SIMSET] 


tj is enabled if and only if (By = 0) A (B2 = O)A... 
(B, = 0) A (Py > 0) A (Py > O) Aves 22x > 0): 
When tj starts firing a marker is reserved in each of 
P1> P2> P3s «++» Pne Let the class of nets be called 
TN ; 


com 
4. In addition to the ordinary places in the nets be- 
longing to TN we introduce a special place CC) ; 


y 


(say pz). A transition will place a stone in pj if and 
only if Pj = 0. Let the class of nets be IN : 
out 


Results 


1. Since in each case we provided the nets with addi- 
tional capabilities over the nets belonging to IN, 
obviously: 


PN & PNyog 
En out 
PN Cc PN 
po com 
2) PN PN 


Proof 


Kosaraju [4] describes a coordination problem and 
proves that it falls outside PN. The problem is as 
follows: There are four cyclic processes, P,, P9,Cj 
and Cy and two buffers B, and Bj. Py and Po are pro- 
ducers which place one item each on top of By and B92 
respectively in every cycle. Cy, and Cy consume one 
item each from the bottom of B, and Bg respectively. 
However, C] has higher priority than C2 so that C2 can 
consume only if By is empty. To prove that PN CPN. 
we will give an interpreted net belonging to TNoon, 
which represents the desired coordination. The net is: 


om 


B g 


2 


2 


3a. PN 


PN oe 
log — com 


For every net Nige = (T,P,A,M°) there exists a net 


4 t t g ° ' 
net N oom (T', P’, A’, My )D € TN om such that T CT 


and N . . The result 3a follows from this. 
We wit?®not go into the details of a proof but will 
illustrate the idea by means of an example. Let the 
net N below be part of a larger net Ny, = (T,P,A,M°) 


og 
belonging to aoe: 


Ij ,L' 
19 (t1) =I,(t), Nowe = WN’. Continuing this pro- 
cess until all places of the form / \ are eliminated, 


we end up with a net N € TN and an interpretation To 


tf 
[T'', E] such that 11,1, 


out 


This shows that PNoqgge c PN and from result 1 we con- 

clude that PNog¢ = PN. Since PN C PNoom we also con- 
hat PN PN 

clude t bor & ae 

Comments: We feel that PN C PNi5,° We are also 

examining other classes of nets. For example, in 

addition to the ordinary arcs between transitions and 

places we allow the following: 


i 


Py 


t, will place a marker in p, if and only if Pj > 0. 
T 1 ; 1 
Let the resulting net be N'. Then obviously N'=N, . Another class of nets is those where we allow a trans- 
log ition to nondeterministically place a marker in one or 
By applying a similar procedure to each transition with more of its output places. The results obtained so 
OR input logic we end up with a net Nom Which is far indicate that TN.., is a very powerful class of 
strongly equivalent to N with respect to T. nets. 


log 


3b. Kosaraju's problem 1 and proof [4] can be used to Safe nets _ 


prove that PN oe za PN. : 


on If one considers only safe nets (where each place can 
4a. PN = PN contain at most one marker at any stage), then it can 
out be shown that for every Noop= (T,P,A,M°) € TN... that 

For every net N-—— = (Ty; Pi; Al> Mi°) e Thee and is safe, there exists a safe net N € TN such that 


; : u 

interpretation fT" [T,E], there exists a net N= (Tg, P2, N ea Z N. Again, we will only demonstrate the tech- 

Aj, M2”) € TN and interpretation Ij[T",E], such that nique of obtaining N with the help of an example. Let 
11,12 the net N; below be part of a safe member es of TN oom 


out 
Again, we will not go into details of a proof but will 
illustrate with an example. Let the net N below be 


part of a larger net Nour: 


re 


The fact that Nao 


is safe permits us to introduce 
places PD » Po» PG which are complements of the places 
Pl» Pg aid py respectively. I.e. py has a marker if 
and only if pj does not. Every transition that causes 


resulting in the net N'. Here we have introduced a a marker to be put in pj, should cause a marker to be 
place > which is a complement of p in the sense that p removed from p,. Every transition that causes a mark- 
has a marker if and only if p does not. The reader can er to be removed from p; should cause a marker to be 
convince himself that under the interpretation I' placed in PT: We now have 

{tT U ee eee E], where I'(t) = I, (t) for t € T and N' = Noom ° 
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By continuing the process of replacement we end up with 
a net N_ € ITN such that \ es Noom If PN.,|safe denotes 

the set of coordinations representable by safe members 

of TN,, then PN | 5a ee PNeom| safe. From results i and 4 
PN | safe = PNiog i safe = PNog¢ | safe = 2 en safe. 
Thus, even though TN.o, is a powerful class of nets, 
in practice one would probably be more concerned with 
safe nets and here the modifications made to ordinary 
Petri nets do not increase the overall power. 


TII. CORRECTNESS 


When we say that a "Petri net N is correct", intuitive- 
ly what is meant is that the Petri net does what the 
designer intended it to do. 
a Petri net is constructed which represents the de- 
sired coordination. First and foremost we are not at 
all concerned with whether the Petri net is the best 
one for the given problem. In fact, we will not even 
try to prove that the Petri net effectively represents 
the desired coordination. We shall, however, try to 
prove very restricted statements about a net which are 
provided by the designer. The kinds of statements we 
will attempt to prove are: 


1. At any given time only one of the transitions from 
the set {t], ..., tk} may be firing. 

2. Two given transitions will never conflict. 

3. A given place is safe with respect to a particular 
marking or a given marking is safe. 

4. Agivenplace can contain at most N markers 

5. A given transition is live. 

6. A given marking is reachable from another. 

7. A given transition has fired at most x times. 

8. In general it may be very difficult to show that 

a "net is deadlock free". Again, the designer will 
have to provide statements, for example, "Every trans- 
ition in cycle C is live at every stage", from which 
he can reasonably conclude that the net will not hang 


up. 


In the following we present two methods to prove the 
correctness of Petri nets: Computational induction 
and inductive assertions. 


Computational Induction 


Here we develop certain relations that remain invar- 
dant during the simulation of a net. By using these 
relations suitably we will be able to prove certain 
properties about the net. According to our simula- 
tion rules, when a transition starts firing it re- 
serves a marker in each input place. Reserved markers 
are invisible to all other transitions. However, in 
the invariant relations, all reserved markers are also 
counted and assumed to be intheir current places. The 
relations follow trivially from the simulation rules. 
Let, 


Mi: Number of stores in py initially 
Pi: Number of stores in py at any instant 
Ti: Number of times tj has fired till any instant. 


Relation 1: Let I; = {set of transitions with p,; as 
output place}, 0; = {set of transitions with pj as in- 
put place}, then 


te > SO 
Let t 


Relation 2: be a 


ay Po ao? eee, Por-1 sca, 

path in the net such that tas Poi 1 <i<k are dis- 

tinct. If in addition I}, = {tg,}, 1 < i < k-1 then 
by ay eee 


Given a particular problem, 


we have a simple path and Ta, eee + > M.- 


Relation 3: If S; is a simple path from t; to t. and 
So is a simple path from ty to t; then Sj, S92 forms a 
simple cycle. If in addition every place on a simple 
cycle has only one input and one output arc then we 
have a pure cycle. Let S be a pure cycle then: 


Day ig > Mi ~ ONS 


Py in S$ Py in S 


(say). 


We have used these relations to prove simple assertions 
about nets, and will illustrate the method by means of 
an example. Consider the producer consumer problem 
with bounded buffer. The producer places items in a 
buffer. (length N) and the consumer consumes them. 

The problem is to coordinate these two essentially in- 
dependent processes so that the consumer does not try 
to take an item from the buffer when it is empty and 
the producer does not place an item where the buffer is 
full. The Petri net that represents the described co- 
ordination is given below: (the numbers in the places 
denote the initial number of markers) 


Pr Pio 
i) : 
My eoace a2 


ty deposi 

P5 & Po 
ts 2 
P6 © 

“6 


consume 


We are interested in proving the following properties 
for this net: 


1. t4 and tg cannot be firing at the same time, i.e. 
the producer and consumer do not try to access the 


buffer at the same time. 


2. O<Ty-To <N. I.e. there is no buffer overflow 
or underflow. 


3. the net is deadlock free. 


Proof 1: 

Tg +1T3< 1+Tj9 + Ts (1) By Rj 

Py, = 13-T, (2) By Ry 

P4 < 13-Ts (4) from (2) + (3) 


Similarly, Py2 < Tg - Tyg (5) 

Therefore, Py + P72 <1 (6) from (4), (5), (1) 
From 6 and the simulation rules we conclude directly 
that Ty, amd Tg cannot be firing at the same time. 


Proof 2. 


Tyg < T9 + N (1) By Ro 

Therefore, T, - Tg < N 

i.e., the number of deposits - the number of removals 
<N. Therefore, three can be no buffer overflow. 


(2) By Rg 


Tg < Ty 
Therefore, Ty - Tg > 0 
i.e., number of deposits - number of removals _ 0. 
Therefore there can be no buffer underflow. 


Proof 3. 


For this particular problem it is easy to see that dead- 
lock can occur only if P> = Pg = O and there is no way 
to change this situation. (pure cycles can be repre- 
sented by the subscripts of the places only since there 
is no ambiguity) 


S; = 1,2,3,4,5,6,7,1 is a pure cycle 

So = 10, 11, 12, 13, 14, 15, 10 is a pure cycle 

S3 = 3, 4, 5, 6, 9, 11, 12, 13, 14, 7, 3 is a pure cycle 
Ns. = 1 (1) 

Ns, = N (3) 

a= Pa th Py Pech Pek (4) from (1) 

b=P,, +P, +P,,+F,, <1 (5) from (2 


11 LZ iS 14 —- 


Therefore, a+b < 2 
But if N>2 and PJ 


atb=N> 2 


| from (4), (5) 
= Pg = O then 


from (3) 


Therefore, we get a contradiction. Thus, for N > 2, at 
no stage can both Pz and Pg be zero. Therefore, there 
can be no deadlock for N>2. For N = 1 and N = 2 separ- 
ate arguments can be given to prove that the net is 


deadlock free. 


Inductive Assertions 


This method was introduced by Floyd [2] to prove the 
correctness of sequential programs and the same tech- 
nique was used by Lauer [5] for proving parallel pro- 
grams correct. We have taken the basic ideas from [5] 
and modified them to be applicable in the framework of 
Petri nets. Here again, our aim is to prove that a 
Petri net is correct with respect to a particular given 
assertion A. The procedure is as follows: with each 
transition in the net we associate an assertion. Our 
aim is to prove that every time a transition is enabled, 
the corresponding assertion is true irrespective of the 
particular simulation which caused this transition to 
be enabled and irrespective of the state of the rest of 
the net. Once this has been established, the truth of 
A has to be deduced from the assertions at the transi- 
tions. 


Let N = (1,P,A,M°) be a Petri net. An assertion a; 
asserted with a transition ty ¢ T is a predicate on the 
values of P, and T, where pp € P and ty € T. The Petri 
net is correct with respect to the assertion a, if and 
only if for each simulation of the net that enables ty> 
a; is true when t; is enabled. The net N is correct 
with respect to a set of assertions if and only if it is 
correct with respect to each assertion in the set. Let 
i = set of input places of ty and 0; = the set of out- 
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put places. Then we have the following: 


Induction Theorem 


To prove that a Petri net N = (T,P,A,M°) is correct 
with respect to a set of assertions {a;|ty e€ T} it is 
sufficient to prove the following: 


(1) ay 


(2) For each ty € T, let Py = {p|p e I; A (@,0) € M° } 
i.e., the set of all initially unmarked input places of 
t;. Let Py = fay, a2, ---> a}. Let 1) {t,]q; € O,}, 
1<j<n, i.e., the set of all fradaitions of which q: 
an output place. Let By; = {(bj, bo, «e+, bd n) | tp: E T; 
Each n-tuple in By gives the set of transitions dhich 
when firedcause markers to be placed in the initially 
unmarked input places of ty. Let Fire (bj, ..., bp) 
denote the fact that the transitions tp], ..., thn fire. 
Then for each ty eT, 


is true for all t; that are enabled in M°, 


aby A ab» sa ws b,) => a. 


for all (by, Do, eooeey by) € By eeee (1) 


Proof: Obvious 

Each equation of the form (1) is called a verification 
condition. It should be clear to the reader that the 
verification conditions are really very strong. Thus 
the conditions are not necessary but only sufficient. 


To prove a net correct, one may often have to construct 
an augmented net. Let Ny = (Ty >P1, Aj, M1°) be a Petri 
Net. Then No = (To, Po, Ag, Mo°) is an augmentation of 
N, if and only if Tj] © To, PC Pg, Ay C Ag, My° CMp° 
and Ty 
Ny = No. 
One can show that, if Nj is an augmentation of Nj then 
No is correct with respect to a; where t; € T; if and 
only if Nj is correct with respect to a,'. Here a,' is 
the same as a; with all references to t € To-Ty and 
p € Py -Py deleted. 


Thus, to prove that a Petri net N = (T,P,A,M°) is cor- 
rect with respect to an assertion A one goes through 
the following steps: 


1. Formulate the assertion a, for each transition t. 
2. Prove that all assertions associated with transi- 
tions that are initially enabled are true. 

3. Prove that all the pertinent verification conditions 
hold and conclude that N is correct with respect to 
{a,|t; e T}. (Instead of (2) and (3) one may construct 
an augmented net N' of N, associate appropriate asser- 
tions with the transitions of N', carry out (2) and (3) 
for N' and conclude that N is correct with respect to 
{a. ilty € TH) 

4.” Deduce that the net operates correctly with respect 
to the main overall assertion, A. 


We have used this method to prove the correctness of a 
Petri net representation of the producer — consumer 
problem with respect to an overall assertion. Since the 
assertions and proof are essentially similar to those of 
Eauer [5] we will not present the example here. Ina 
subsequent report we will present weaker verification 
conditions, examine whether it is necessary to associate 
assertions with each and every transition and develop 
"local" conditions under which places, arcs and tran- 
sitions can be added to a net N resulting in an augu~ 
mented net N'. 


IV. CONCLUSIONS 


We hope that the discussion in Part II sheds some 
light on the capabilities and limitations of Petri nets. 
TNoom Seems to be a powerful class of nets. It is pos- 
sible that these nets do provide a correct, formal 
counterpart to the vague notion of a "coordination 
problem". We will examine this aspect in another re- 
port. Also, Petri nets seem to be sufficiently power- 
ful if one is concerned only with safe nets. This may 
very well be the case in practice. 


In part III we have established the feasibility of us- 
ing the methods of computational induction and induc~ 
tive assertions to prove restricted kinds of statements 
about Petri nets. Ultimately, work in this direction 
will facilitate the process of convincing oneself 

that a general concurrent system is correctly coordin- 
ated. 
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ABSTRACT 


FLOWWARE is an interactive, graphical language to aid 
in the understanding and design of digital networks. 

The language is based upon the concept of flow charting. 
The user specifies the register layout of the network 
and the sequential operation in the form of a flow 

chart on a graphics terminal. The flow chart allows a 
user who is unfamiliar with the network to easily under- 
stand the function and operation of the network. 


I. INTRODUCTION 


Many languages [1-21] exist which aid the user in the 
specification and design of digital networks but they 
do not aid the user who is unfamiliar with the network 
in his attempt to understand the function and operation 
of the network. Flow charting of a program has been 
recognized as an easy method to help in the understand- 
ing as well as the debugging of a program. Therefore, 
since digital networks in many ways resemble a program 
(especially on the register transfer level), flow 
charting should aid the user in the understanding of 
digital networks. 


Graphical languages using a flow charting concept have 
been proposed by Rouse [17] and Bell, et al. [18] but, 
to this date, they have not been implemented on a com- 
puter. Also Digital Equipment Corporation has a modu- 
lar computer, the PDP-16, whose functions can be speci- 
fied by the purchaser through a special purpose lan- 
guage called CHARTWARE [20]. 


FLOWWARE [22-24] is an interactive, graphics language 
which allows the user to define a digital network in a 
manner similar to flow charting. The user can specify 
both the register layout as well as the sequential be- 
havior of the network by using a flow chart and "draw- 
ing" the network on a graphics terminal. Consider a 
simple problem of specifying on a functional level a 
counter to count up from zero to six and then back down 
to zero. Figure 1 is an example of the flow chart 
necessary to describe this system. The elements are a 
three bit register COUNTER and a control signal UP. 

The blocks START and TERMINATE specify, respectively, 
the beginning and end points of the description. The 
reader should recognize that figure 1 represents exactly 
the information that the user would specify on the 
graphics terminal. Each rectangle, arrow, and diamond 
represents an element of the language FLOWWARE. 


Figure 1 also shows some of the characteristics of the 
language. It is a graphical language and hence, gives 
the user a pictorial view of the network. The user 
specifies the register layout and can show the paths 
available for data transfer, the control signals regu- 
lating these transfers, and the functions which modify 
the data. Also, the sequential operation of the network 
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is shown in a graphical manner by a flow chart. The 
language is a means of specifying the functional be- 
havior of a system without regard for the technology 
used for hardware implementation. FLOWWARE has been 
developed with the understanding that such problems as 
races, hazards, interconnection layouts, and fault 
analysis are not to be solved with this system. Its 
main purpose is to aid understanding but it can also be 
used in the initial phases of design when ideas are at 
the block diagram and functional level. 


FIGURE 1 


An Up/Down Counter (a) Control Flow Phase 


(b) Register Layout Phase 


START 


:UP=1 
:COUNTER=0 


[couvrercs) | Cae] 


(b) 


: COUNTER=COUNTER_.ADD. 1 : COUNTER=COUNTER .SUB. 1 


= (=UP* (COUNTER=0) ) 


TERMINATE 


(a) 


To assist the user, FLOWWARE is interactive and allows 
simulation of a design. The interactive nature permits 
the user to obtain his results immediately from a 
simulation run. Simulation helps a user understand a 


network. He can change inputs and control signals to 
see what effect they have on the network, if he so de- 
sires, and resimulate it. In other words, the user 
interacts with a network to understand it or to verify 
its operational correctness. 


FLOWWARE makes use of the IDDAP (Interactive Digital 
Design Assistance Package) system as written by Crall 
[15]. IDDAP is an interactive language which is a sub- 
set of Chu's CDL [10]. As such, it is oriented to text 
input rather than graphical input. Essentially, a pre- 
processor to handle the graphics information was added 
to IDDAP. There were several reasons for using IDDAP, 
one of which was that IDDAP is already an interactive 
system. Also IDDAP has a simulator to allow the de- 
scription to be simulated. This was considered impor- 
tant because it makes it easier to verify that the 
system is working correctly, and also, as already men- 
tioned, a simulator is useful in understanding the 
operation of a network. Finally, IDDAP handles trans- 
lation of text input. In spite of the graphics nature 
of FLOWWARE, it is necessary to describe some operations 
by register transfer statements. Hence in order to 
concentrate on the graphics portion, rather than a 
simulator and text translator, a preprocessor was added 
to IDDAP. The major purpose of the preprocessor is the 
interconnect the graphical elements in the correct man- 
ner as specified by the user. 


Often, when a person is describing a digital network 
informally, he draws a register layout to give an over- 
all view of the system. Then he inserts the control 
signals and explains the sequential operation of the 
network. FLOWWARE formalizes this process. FLOWWARE 
relieves the user from drawing the elements. It forces 
the user to specify the register layout. It allows the 
user to use a graphical input in the form of a flow 
chart to specify sequential operations. The exact pro- 
cedure and elements to perform these functions will now 
be explained. 


II. FLOWWARE ELEMENTS AND COMMANDS 
This chapter presents the elements and commands of 
FLOWWARE. FLOWWARE has two description phases. Phase 
one is the register layout or information flow phase 
which serves to define the various components and show 
how they are interconnected. This phase is similar to 
the variable declaration statements of most languages 
but has the advantage of giving a pictorial view of the 
system. Phase two is the control flow phase. This 
phase makes use of the definitions in phase one to de- 
scribe the data and control flow of the system by 
specifying a flow chart. The flow chart gives the se- 
quence in which functions and decisions are activated 
along with the control signals regulating the events. 


Table 1 presents the elements and commands as well as 

a brief description of their purpose. Each element is 
drawn by the computer on a graphics terminal at a posi- 
tion specified by the hand movement of a cursor using a 
joystick or mouse. Elements are defined by positioning 
the graphics cursor at the major defining point and 
typing the appropriate command. The major defining 
point is that graphics point which denotes the position 
at which the element is to be drawn by the computer. 
Some elements have a minor defining point because two 
points are necessary to define that element; for exam- 
ple, a line. Most of the elements have some text asso- 
ciated with them. The text defines the additional 
information needed for the element. In many cases this 
is the name by which the element is to be referenced, 
or some function to be performed. Editing features are 
also provided to allow the user to add, change, or 
delete elements or text. 


TABLE 1 
Elements and Commands (a) Phase One 
Meaning Computer Response 
Define Memory Input ———__¥. 
Address 
Register 
Output 


Define Information flow 
connector 


Define Function 


Define Control signal 


Define control flow Line 


Define dEcodEr 


Define Unary operand 
function element 


Define Binary operand 
function element 


Define clock 


Define terminal 


Define Subregister 
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TABLE 1 


Elements and Command (b) Phase Two 


Define Function block 


Define Go to line ———> 
Define Control signal [| 
Define control flow Line “77 7"™ ? 
Define Decision block 


Define dEcodE block 


Define start block 


Define end block END 
Define terminate block | TERMINATE | 


A. PHASE 1 OR INFORMATION FLOW PHASE ELEMENTS 

Phase 1 or information flow phase is used to define the 
components of the digital system to be simulated, to 
describe the register layout of the system, and to show 
the functions to be performed on the data. The basic 
elements of this phase are given in the following 
sections. 


1. Register and Memory 


The register and memory are common elements of a com- 
puter. As such the user can define a register and its 
length as well as a memory with its word length and the 
number of words. Table 1 shows the memory element. 

The input and output points specify those points by 
which the memory is accessed for writing and reading 
respectively. The memory address register, defined as 
a portion of the memory element, specifies the word of 
memory to be accessed. 


2. Information Flow Connector 

The information flow connector is used to connect two 
other elements. It shows the direction that informa- 
tion flows between these elements. Figure 2 is an 
example of the use of an information flow line. Two 
bit registers A and B are defined by the appearance of 
the register symbol, and the information is assumed to 
flow from A to B. The equivalent IDDAP [15] statement 
would be ": B = A." 


FIGURE 2 
Example of Information Flow Connector 


3. Function 


The function element allows the user to define a par- 
ticular set of operations which can be referenced in a 
subroutine-like manner. The operations are defined by 
IDDAP statements. The function is activated, in phase 
2, by the statement : DO name where name is the name of 
the function block. 


4. Control Signal and Control Flow Line 


The control signal and control flow line are used to 
control a data transfer between two elements. The con- 
trol signal defines the name by which the transfer is 
referenced and the control flow line points to the 
transfer to be controlled. Figure 3 shows an example 
of control signal C controlling the transfer A to B. 
The statement : DO C, in phase 2, will cause the trans- 
fer to take place. 


Examp se 
\ 


5. Decoder 

The decoder decodes an n bit register into one of ye 

control signals. Figure 4 shows a decoder DEC which 
decodes the register REGA into control signals. Only 
the decode of zero and one in REGA are shown in the 


figure. These control signals can be used to control 
other transfers. The control signal CON controls the 
decode. When the statement : DO CON is specified, REGA 


is decoded and the appropriate action takes place based 
upon the current value of REGA and where the control 
flow lines point. 

FIGURE 4 


Decoder with Control Flow Lines 
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6. Unary and Binary Operand Function Elements 


The unary and binary operand function elements allow 
the user to perform standard functions on the input 
operands. These operands can be registers or memory. 
The unary operand function element requires one input 
operand and the binary operand function element requires 
two. The result is placed in the register or memory 
location specified as the output operand. The exact 
function to be performed is specified by pointing a 
control flow line at the function name. Tables 2 and 3 
specify the functions available with these elements. 

The use of this feature is best explained by an example. 
Referring to Figure 5, register A is both the input and 
output operand. The control signal is INC, and through 
the control flow lines, it points to the CU portion of 
the unary operand function. The mnemonic CU means 
Count Up. When INC is referenced, the result is to add 
one to the input operand, A, and transfer the result to 
the output operand, A. In effect, register A is incre- 
mented by one. To reference this function, the state- 
ment : DO INC causes the A register to be incremented. 


TABLE 2 
Functions of Unary Operand Function Element 


Mnemonic Meaning 
NOT Logical Not 
COM Two's Complement 
LS Left Shift one position 
RS Right Shift one position 
LC Left Circulate one position 
RC Right Circulate one position 
CU Count Up one 
CD Count Down one 


TABLE 3 
Functions of Binary Operand Function Element 


Mnemonic Meaning 
ADD Add operand 1 to operand 2 
SUB Subtract operand 2 from operand 1 
MUL Multiply operand 1 by operand 2 
DIV Divide operand 1 by operand 2 
REM Remainder, operand 1 modulo operand 2 
OR Logical inclusive OR of operand 1 and 
operand 2 
AND Logical AND of operand 1 and operand 2 
XOR Logical exclusive OR of operand 1 and 


operand 2 
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FIGURE 5 


Example of a Unary Operand Function 


7. Clock 


Simulation has two modes: In clock 


clock and no clock. 
mode, update of registers under clock control does not 
occur until there is a clock pulse, i.e., when the 


clock variable changes state. In no clock mode, reg- 
ister update takes place immediately [15]. The clock 
element is used to define the register to be used for a 
clock. 


8. Terminal and Subregister 


The terminal and subregister allow the user to define 
terminals and subregisters. Terminals allow the user 
to refer to a boolean expression by a single name. 
Similarly subregisters allow the user to refer to a 
part of a register by a single name. Therefore the 
text associated with these elements are assignment type 
statements. 


B. PHASE 2 OR CONTROL FLOW PHASE ELEMENTS 


Phase 2 or the control flow phase describes the se- 
quential nature of a digital system in terms of a flow 
chart. There are only nine basic elements needed in 
this phase. Three of these tell the simulator where toa 
start and end, and the other six describe functionally 
the operation of the system. 


1. Function Block 


The function block is used to describe a particular 
function which may be one or more IDDAP statements. It 
is the basic element of FLOWWARE phase 2. The dis- 
tinction between this function block and the one in 
phase 1 is the method of activation. The phase 1 func- 
tion requires a subroutine-like call whereas the phase 
2 function block is activated when it is encountered 
during the normal sequence of events as specified by 
the flow chart. In fact, a statement within the phase 
2 function block is necessary to call the phase l 
function. 


2. Go-To Line 

The go-to line defines the direction of control flow or 
the sequence in which operations are to be executed. 
The order in which elements are executed is determined 
by the direction of the arrow. Also the user can spec- 
ify parallel paths with this element as shown in Figure 
6. Functions B and C are executed in parallel, after 
function A has completed execution. 


FIGURE 6 


Parallel Execution of Function B and Function C 


(Dollar signs $ denote comments within element) 


$FUNCTION A$ 


$FUNCTION B$ $FUNCTION C$ 


_3. Control Signal and Control Flow Line 


The use of control signals and control flow lines in 
phase 2 are different from phase 1. It is best ex- 
plained by the example shown in Figure 7. Function A 
and Function B are two user defined functions and BACT 
is a control signal on Function B. Function B is exe- 
cuted only after Function A completes execution and if 
BACT is true or a logic 1. If BACT is false, the sys- 
tem "waits" until BACT becomes true. If this is the 
only path in the system and BACT is false, then the 
simulation will be halted without executing Function B. 
If there are parallel paths, Function B is executed 
when BACT is set to logic 1 by one of the other paths. 


FIGURE 7 


Example of the Use of a Control Signal 


(Dollar signs S$ denote comments within element) 


SFUNCTION A$ 


SFUNCTION BS$ 


The decision block is used to decide between two alter- 
nate paths. The decision block tests either a boolean 
expression or a relationship expression, such as A > B, 
associated with the element or a control signal point- 
ing to the element. When the expression or control 
signal evaluates to a logical one, then the exit point 
is either the top or bottom point of the diamond. If 
it evaluates to a logical zero, then the exit point is 


4. Decision Block 


95 


the right or left point. The exit point is the path to 
be taken when the decision is made. 


5. Decode Block 

The decode block decodes the register defined within 
the block into one of 2" signals where n is the length 
in bits of the register. The go-to lines leaving this 
element point to the next path to be taken based upon 
the value decoded. The go-to lines have associated 
with them a number which represents the decode of the 
block. A go-to line without a number, only one is al- 
lowed, means that for any decoded values not specifi- 
cally mentioned on other lines, "take this path." 


Figure 8 shows an example of the use of the decode 
block. When Function A completes execution, the re- 
gister INST is decoded. If INST = 0 then Function B is 
executed. If INST = 1 then Function C is executed. If 
anything else, Function D is executed. 


FIGURE 8 


Example of the Use of the Decode Block 
(Dollar signs $ denote comments) 


SFUNCTION A$ 


$FUNCTION D$ SFUNCTION BS 


$FUNCTION C$ 


6. Start, End, and Terminate Blocks 

These three elements control the simulation. The simu- 
lation starts at the start block and ends at the termi- 
nate block. The end block is used to specify the end 
of a path. When it is encountered, the simulation is 
not halted but any other parallel paths are executed. 
The terminate block halts the simulation when it is 
executed even if there are unexecuted parallel paths. 


IIt. USE OF FLOWWARE 
This section will present some simple examples using 
FLOWWARE. The major emphasis will be on the input 
language rather then the output of the simulator. 
already mentioned, Figure 1 is an example of an up/ 
down counter. . 


As 


Consider the digital network shown in Figure 9. The 
problem is to add register A to B or to subtract B from 
A. In both cases the result is to be transferred to 
register A. The control signal C is to be used to 
determine whether addition or subtraction is to be per- 
formed. If C is true, perform the addition. Figure 
9(a) shows the register layout with signals AD and SB 
controlling the addition and subtraction respectively. 
Figure 9(b) shows the control flow where the signal C 
is tested. 


FIGURE 9 


Addition/Subtraction Network 


TERMINATE 


(b) Control Flow Phase 


in Figure 10, the fetch and execution of a LOAD ACCU- 
MULATOR (register A) instruction for a small computer 
is shown. It is assumed that direct addressing is used 
and that the six bit computer word is divided in half 
with three bits for the operation code and three bits 
for the address. Register P is the program counter, R 
is the instruction decode register, RI is the operation 
code subregister, T is the decode of the clock counter 
K, and Q is the decode of a control register RQ. When 
RQ = 1, the computer is in the instruction fetch cycle. 
When RQ = 2, it is in the instruction execution cycle. 
The operation code for the LOAD ACCUMULATOR is zero. 
The register layout is shown in Figure 10(a) and the 
control flow is shown in Figure 10(b). Notice the use 
of the decode block in both phases as well as the use 
of the control signals. 
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FIGURE 10 


Fetch and Execution of an Instruction 


5S 


:K=K.COUNT.1° 


(b) Phase 2 or Control 
Flow Phase 


IV. CONCLUSION 


FLOWWARE has been implemented on the computer system at 
the University of Missouri-Rolla. This computer system 
consists of the IBM System/360 Model 50 Computer and 
several Data General Corporation NOVA-800 Minicomputers. 
The graphics terminal used as the main input/output de- 
vice is the T4002 Tektronix Graphic Computer Terminal. 
FLOWWARE essentially consists of two programs: one 
written in PL/1 for the System/360 computer and one 
written in assembler for the NOVA computer. The mini- 
computer is responsible for drawing the elements and 
local editing functions, and the main computer is re- 
sponsible for translation and simulation of the descrip-+- 
tion. 


FLOWWARE has been designed for user convenience. The 
user is relieved of the burdens of drawing the elements 
and typing long command lines when single letters will 
suffice. The interactive nature of FLOWWARE permits 
the user to obtain his results immediately. All ele- 
ments and all interconnections between elements are 
clearly visible. Simulation allows the user to see the 
network "work" under a variety of input conditions. 
Text and element editing permits modifications to the 
description. The information flow phase description 
allows the user to specify graphically a network of 
registers, etc., which resemble a subroutine and is 
executed like a subroutine in the control flow phase. 


By means of its implementation on the UMR computer 
system, FLOWWARE has shown itself to be a useful tool 


in the process of digital design. FLOWWARE has the 
flexibility needed to meet a diversity of user demands 
while still retaining the structural ordering necessary 
to insure logical consistency within any one descrip- 
tion. Its similarity to flow charting, its pictorial 
nature, its ease of use, and its interactive qualities 
combine to produce a language which solves the problems 
present in most text oriented languages, the problems 
of comprehension and readibility. 


This project is supported, in part, by NSF Grant 
GK34076. At present, work is being done on FLOWWARE 
to improve and expand its capabilities. 
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ABSTRACT.— A Design Automation System for the RT level of 
design is described. The System explores the design space by 
finding alternative implementations for a user given behavioral 
specification. The alternative solutions are obtained by 
transformations on a graph model. These transformations effect 
trade —offs between the cost of the hardware and the speed of 
the algorithm. Heuristic routines are used to reduce the design 
space by exploring only those alternatives whose characteristics 
approach a user given set of goals. 


1. INTRODUCTION 

A computer system is composed of thousands _ of 
interconnected components. The basic components of computer 
systems have gone through an evolution from relays, to vacuum 
tubes, to transistors, to logic gates (small scale integration), to 
registers (medium scale integration), and to memories and 
processors (large scale integration). As the basic components 
increased in logical power more complex computer systems 
became feasible. 


The construction of these computer systems has been 
simplified by computer aided design. Early attempts at design 
automation were directed towards a reduction in cost and time of 
the design process itself [1]. These objectives were 
accomplished by relieving engineers of repetitive time consuming 
tasks. This approach to design automation limits itself to filling 
the gap between the low level design specifications and the 
manufacturing data. The inputs to the systems are, generally, in 
terms of Boolean equations which the system then translates into 
an equivalent gate level specification. The Boolean equations 
specify the desired behavior of the finished object. Most of the 
synthesis algorithms at this level deal with the problem of 
reduction or simplification of the Boolean equations. 


Recent efforts at design automation have been directed 
towards a system capable of accepting a high level description 
and translating it into an equivalent gate level structure. 
APDL[2] and ALERT[3] are two such systems. 


The essential feature lacking in these existing systems is the 
exploitation of alternative implementations derived from the initial 
behavioral specifications. This paper deals with the description 
of an automatic design system that explores the design space for 
the register transfer level. The Register Transfer (RT) level [4] 
is characterized by the following basic components: Registers, 
register transfers, and transformations on the contents of 
registers. When completed, the system will take as inputs the 
specification of the desired behavior in some high level RT 
language and the specifications of the hardware RT level 
“components. The output is the specification of the hardware 
which attempts to optimize the system along some specified 
dimensions of the design space. We will restrict ourselves to the 


The research in this paper was supported by National Science 
Foundation Grant GJ 32758x. 


| Design RT level description 
constraints of algorithm (ISP) 


Figure 1. An RT level design automation system 


cost and time dimensions. Thus a designer specifies design 
constrains to the system, such as whether the solution should be 
the cheapest, the fastest, or some trade-off between cost and 
speed. 


The automatic design system is depicted schematically in Fig. 
1. The description of the algorithm is given in the RT language 
ISP [4] and translated into a graph representation. The user 
can, however, bypass this step and provide its input — the graph 
~ directly to the System in an assembly —-like notation. This can 
be used to design systems not describable in ISP. Subsequently, 
various transforms on the graph are attempted to establish a new 
solution to the problem. A set of heuristics guide this exploration 
of the design space by using the given design constraints to 
decide which solutions should be kept to generate other solutions 
by yet another application of the graph transformations. 


Which set of transforms to apply is determined by the PMS 
(Processor Memory, Switch) type [4] of the modules. The set 
of transforms are general and can be used with any set of 
modules which conform to a particular PMS type. Transforms 
(module dependent) which depend on the details of a certain 
module set, such as the cost/performance ratio betwen two 
modules of the same type, are not included in the general 
(module independent) set of transformations although it is a 
simple task for a designer to add extra transforms to the set. 


The following sections describe different portions of the 
system. Sections 2 and 3 describe the system inputs, the module 
set and the initial description of the algorithm to be implemented. 
Section 4 delineates the PMS types which are used to select the 
set of transforms discussed in section 5. The cost or gain 
achieved by applying a transform is treated in section 6, while 
the heuristics which drive the design process are presented in 
section 7. Finally, an example problem is given in section 8. 
The various sections will be treated by way of examples. The 
complete details can be found in [5]. 

2. A MODULE SET 

To lend credibility to the discussion of the system, a 
commercially available [6] set of RT level modules, called Register 
Transfer Modules (RTMs), will be used as the module set. 


The following paragraphs briefly introduces the modules and 
discusses the design process using them. A more detailed 
description is given in [7]. The flowchart format of the RTM 
notation is so transparent, however, that the detailed reference 
probably need not be read to understand this one. } 
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Control part Data part. 


entry RTM bus 


Kev(C €8) 


Kbr (P<O>) 


‘Kev (P ||[(P +MPD)/2) 
———— 


i 


Figure 2. An RTM multiplier 


entry 

Kev(C <8) Mc (8) 

Kbr (P<O>) DMgpa(C) 
ees Kev (P © (P+MPD)/2) 
Kev(C¢C~-1) DMgpa(P,MPD) 
Kbr (C=0) Kbus 


n ly 
Figure 3. Short hand notation of the RTM multiplier 


The RTM set consists of about 35 module types falling into 
four classes. Each RTM system is built around a common bus for 
facilitating data transfers among the registers of the modules 
connected to it. The three types of modules that connect to the 
bus are: M's — Memories for holding single bits (Boolean), or 8 
-, 12-, or 16-bit integers, and arrays for holding vectors of 
integers; T's — transducers for interfacing with the environment 
external to RTM (e.g. lights and switches, analog —digital 
converters, serial interfaces for teletypes); and DM's - 
Data-—Memory components to hold data and carry out logical and 
arithmetic operations on this data. A fourth type of module, the 
K—type, controls the operations in the other three. A network 
of K modules is isomorphic to the flowchart of the computational 
algorithm that is. to be performed, and each individual K module 
evokes same Operation(s) in the data part of the system 
(centered around the bus). The bus has timing interlock signals 


to interlock data transfer operations evoked by the K modules. 


Multiple buses can be used to increase the performance of a 
system. 


Figure 2 depicts the RTM implementation of an 8—-bit shift 
and add multiplier and Fig. 3 the short hand notation for the 


Multiplier s= (C ¢3snext 

Loop :=( (P<0>=>P © (P+MPD)/2); 
(=P<O>=>P ¢P/2) snext 
C¢C-—1snext 
(C#O=> Loop) ) 

); 


Figure 4. The ISP description of the RTM multiplier 


Qo entry 


C8 


Loop := 


P<O>=? 


P¢(P+MPD)/2 


CeC-1 


Figure 5. The graph model of the multiplier 


system. The multiplier is in the P register and the multiplicand is 
in the MDP register and is assumed to occupy the leftmost 8 bits 
of the register. The product will be in the P register. The 
partial products are formed in the left hand side of the P register 
and shifted to their appropriate position in the final product after 
eight transverses of the loop. The multiplier will be used as an 
example for the following discussion. 
3. THE GRAPH MODEL 

There are five basic types of operations in the graph model 
the design automation system uses: 
— branch (Kb), activates one of the output paths depending on 
Boolean conditions 
— serial merge (Ksm), activates its output path when any of the 
input signals arrive 
~ diverge (Kdiv), activates concurrently all paths attached to it 
- parallel merge (Kpm), activates its output path when all its 
input signals have arrived 
— data operations (other) 


The translation process from the input RT language 
description (ISP) to the graph model is straightforward and has 
been programmed. The ISP for the multiplier is shown in Fig. 4 
and the corresponding graph model is depicted in Fig. 5. 


The system as implemented treats each node in the graph as 
composed of a nonempty sequence of the five operation types. 
The only restriction is that nodes must have a unique entry 
Operation (Ksm, Kpm, or data operation) and a unique exit 
operation (Kb, Kdiv, or data operation). In the examples that 
follow, we will explicitly show the control operations by drawing 
them outside their nodes. 
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Figure 6. PMS types 


4. PMS MODULE TYPES 
The decision as to which set of transformations should be 
used is determined by which PMS types the module set can 
emulate. 


In model A (Fig. 6.a), each process communicates directly 
with a single large main memory. The important feature is that 
each process can modify information which is to be used by the 
others. 


In model B (Fig. 6.6), slave memories (buffers) have been 
added to the system. A process can fetch information from main 
memory, but any information to be stored is put in its buffer. 
The buffer acts as intermediate storage between the process and 
the main memory. When a process needs some information it 
looks first in the associated buffer to see if the information has 
been stored there as a result of a previous computation. If not, 
the data is obtained directly from main memory. When both 
processes have completed their tasks, the information in the slave 


memories is transfered to the appropriate locations in main 
memory. 
Model C (Fig. 6.c) differs from model B in that only one 


slave memory is used. One of the processes (Pcl) can fetch and 
modify data directly in the main memory. The other process 
(Pc2) can only fetch data from main memory and uses Mp2 as a 
buffer for partial computations. 


From these models the various conditions on the variables 
for parallel processing can be developed [8,5]. RTM's 
correspond to either model B or C since a process occupies a bus 
and two busses cannot share data without co-operation between 
processes. 

5. THE TRANSFORMS 

The set of transforms for RTM's will be demonstrated by 
example. The full set of transforms is described elsewhere [5]. 
In general, speed is achieved (at some extra cost) by increasing 
parallelism. Cost is decreased by reducing parallelism. For 
purposes of example, suppose that we want-to increase speed. 
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Figure 8. The parallel computation of nodes D,E and B,C1 ,C2 


Consider the multiplier in Fig. 5. The graph model is first 
cleanned up by removing no-operation control nodes (Kpm with 
a single input for example) which were introduced by the ISP to 
graph model translation. 


Associated with each node is a, possibly empty, set of 
variables which indicates which variables are used and/or 
modified by the operation(s). Node D (C¢C-1) depends on 
variable C alone while nodes B, Cl, and C2 depend on P and 
MPD. Hence node D can be computed in parallel with nodes B, 
Ci, and C2 since they depend on different sets of variables, 
This is depicted by the transformed graph in Fig. 7. Note 
further that node E also depends only on variable C. Hence E 
could be performed in parallel with Bl, Cl, and C2, but it must 
follow the computation of D, as shown in Fig. &. 


Sometimes one node may use a variable while another uses 
and modifies the same variable. The first. node can be computed 
in parallel with the second if the first node receives its own copy 
of the variable before the parallel computations starts. Copying 
the variable takes time and requires extra hardware. By defining 
the various ways variables are used it is possible to determine if 
a transformation can be applied and how much will be saved or 
lost in terms of time and cost as shown in the next section. 


The transforms are of a general nature in that they apply not 
only to individual nodes but to subgraphs of arbitrary complexity. 
Each subgraph is also characterized by the variables used and/or 
modified by its computations. Methods for forming these 
subgraphs and their associated variable sets have been automated 
[5] but will not be described here. 


6. DESIGN SPACE TRADE-OFFS 
Two parameters will be used to describe the design space: 
The cost of the hardware involved and the operational time. The 
former is obtained by adding the costs of the components used in 
both the data and control structures. The latter is obtained from 
the average speed of the operations involved. 


For a straight sequence of operations the time required is 
the sum of the individual times, Fig. 9.a. In the presence of 
concurrent activities, the operation time is that of the longest 
(timewise) sequence, Fig. 9.b. When alternative sequences are 
‘jnitiated as a result of a data dependent decision, the time 
required for the execution is not known a priori. In this instance 
a worst case situation will be assumed, namely, that the longest 
path is the one selected, Fig. 9.c. 


T(A)+T(B) 
(a) Sequence 


Max (T(A),T(B)) 
(b) Concurrent 
sequences 


Max (T(A),T(B)) NxT(A) 
(c) Alternative (d) Cycles 
sequences 


Figure 9. Time estimation 


The presence of cycles (loops) adds some complexity to the 
estimation of the operation time. In this case the level of nesting 
is assumed to be proportional to the frequency of execution of 
the operations. Conceptually this is equivalent to replacing the 
cycle by a sequence of multiple copies of the individual 
operations. Since the number of times a loop is executed (i.e. 
the number of copies) is usually unknown, a default (2) is 
assumed. This default may be overruled by the designer by 
specifying an estimate loop count. Fig. 9.d. 


Having defined the parameters of the design space we can 
now describe the trade-offs involved in the transformation rules. 
Connectivity and data dependency are used in the system to 
indicate the feasibility of a transformation. Feasible 
transformations, however, do not imply necessarily any 
advantage in their application, and the desirability of such a 
transformation is indicated by a different set of conditions. 


Fig. 10 shows the effect of one of the transformations, rule 
SP. Node X1 is required to copy to local memory those variables 
used by node (subgraph) B in its computation according to PMS 
types B and C. Likewise the two X2 nodes are required so that 
all the variables transformed by nodes A and B are available to 


Figure 10. Rule Serial to Parallel (SP) 


any of the n ‘paths originally following node B. The trade-offs 
are: 


TIME: Original T(A)+T(B) 

new T(X1) +T(X2) +MAX(T(A),T(B)) 

gain T(A) +T(B) -T(X1) —T(X2) — Max (T(A),T(B)) 
COST: Original C(A)+C(B) 


new C(X1) #n.C(X2) +C(A) +C(B) +0¢.C (Bus) 
extra C(X1) +n.C(X2) +¢.C (Bus ) 
Where o¢ =0 or 1, depending on the availability (e.g. idle) in the 
current version of the system of a bus that could be used by B. 


In the case of rule SP the concurrent computation of A and B 
may not bring about a reduction in time: The transfer operations 
X1 and X2, used to load and unload variables to and from the 
different busses, take a non-zero amount of time. If the number 
of variables transfered is large, this overhead may cancel any 
gains obtained from the concurrent computation of A and B. The 
bus required to execute B may or may not be already present in 
the system. lf it is available («=0) then it can be shared at no 
extra cost. | 


Desirability conditions for other transformations are 
described in [5]. They can be used to eliminate those (feasible) 
transformations where they do not produce the desired savings 
(in cost or time) or where the gain is below a designer specified 
threshold. 

7. HEURISTICS 

Due to the interaction between transformations it is a difficult 
task to formalize the optimization (improvement of alternative 
structures) as a mathematical optimization problem. The main 
difficulty is the fact that transformations apply to subgraphs of 
arbitrary size, and as a consequence transformations in a given 


alternative structure may or may not be feasible or desirable in 


structures derived from it. It is also the case that new cases of 
transformations become feasible or desirable only after a specific 
sequence of transformations has been applied. 


The design space is represented by a timefcost diagram. 
Alternative structures are represented by points in the diagram. 
Except for the original solution, all points are derived, by 
transformations, from other points in the space. These 
relationships will be made explicit by drawing vectors from the 
parent nodes to their immediate (i.e. one transform removed) 
descendents. 
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The exploration of the design space in our system is 
performed by a group of heuristic routines that produce 
alternative designs in a goal oriented fashion, the goal being 
specified by the designer. Ideally, the goal is to find an 
alternative structure whose position in the design space is as 
close as possible to the origin (O cost and O time). This ideal 
case is, however, not easily found in real solutions. The usual 
case is that the least expensive solution is not the fastest and 
vice versa. This characteristic provides a rough classification of 


the design objectives into two classes: minimal cost and minimal 


time. 


Although a designer's aim can be classified according to 
these objective functions it may be the case that the real 
objective is more complicated in nature, namely, some 
combination of time and cost. For instance, the objective could 
be something like: "the fastest alternative structure not costing 
more than x dollars”. 


For simplicity, the subspace of acceptable solutions will be 
defined by a set of straight line segments whose slopes reflect 
the objective functions. In the example above a single straight 
line, parallel to the cost axis would be used to divide the space in 
two halves. Only those solutions that lie in the semispace 
containing the origin are considered acceptable. These solutions 
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Figure 11. Design space reduction 


represent improvements along the design goal. 


More complex constraints can be described by using lines of 
the form $=-—-m.T+b, where m is a parameter indicating how 
many dollars the designer is willing to pay for each time unit 
saved (if time is the primary goal) or how many time units the 
designer is willing to sacrifice for each dollar saved (is cost is the 
objective). An example, Fig. 11, will clarify this description. 


Assume that the primary objective is a reduction in time, and 
that the designer wants a time/cost trade-off of at most m 
dollars for each time unit improvement. Furthermore, assume 
that the original design is characterized by $1 and Tl. The 
“acceptable trade-off" subspace would thus be delineated by 
two line segments: one parallel to the cost axis starting from 
(T1,$1) to (T1,0), and the other through (T1,$1) with slope —m. 
By studying the control flow and data dependencies in this 
original structure, four transformations are available which yield 
four alternative solutions derived from the original one: A,B,C,D. 


By. dividing the space according to the trade-off lines, 
alternatives B, C, and D can be rejected because their 
characteristics are not within the acceptable subspace (ie. they 
take more time or the decrease in time costs too much). The 
alternative left, A, represents improvement in time while the cost 
to achieve the improvement is under the designer's threshold. 
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Figure 12. Controller for a conveyor —bin system 


The process can now be applied to A in an identical manner. 
Design A is taken as the new _ initial solution and a new 
“acceptable trade-off" subspace is defined by a line segment 
(T2,$2) to (72,0) and a line with slope —m through (T2,$2). 
Since in sOme cases more than one alternative can be left for 
further exploration, this process takes the form of a tree walk 
where the nodes represent alternative solutions and the edges 
are the transformations applied. In some instances, identical 
structures can be obtained by different sequences of 
transformations and the exploration of the design space is a 
graph walking process. In any event, a path ends when no 
alternative solutions worth exploring can be reached from a given 
point. When all possible paths have been explored the end nodes 


chosen. 
8. A CONTROLLER FOR A CONVEYOR-BIN SYSTEM 
The following example is taken from [7]. Briefly, the 
algorithm performs the controlling function for a conveyor 
carrying items to be sorted into bins. 


The algorithm is described in ISP and its graph model -is 
shown in Fig. 12. Notice that in this example the nodes 
correspond to sequences of one or more operations. 
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Figure 13. Design space exploration 


Several alternative implementations can be derived from this 
example. They are rather simplistic due to the compactness of 
the algorithm, but they are nevertheless appropriate to show the 
design space and its exploration, Fig. 13. First, assume that the 
fastest solution is sought. All the applicable transformations deal 
with the increment and testing of variables T and | (nodes 11,12, 
‘and 13 of the flowchart), and their concurrent execution with the 
main computation (nodes 5,6,7,8,9, and 10). 


The best solution (timewise) is given by point 4 in the 
design space. In this solution, the main body of the algorithm 
(5,6,7,8,9,10) is computed in parallel with the increment and 
testing subprocess (11,12,13) as a whole. Other alternative 
points are also shown in the diagram (points 1,2,3,5,6). Several 
things can be noticed in the design space diagrams; for instance, 
point 2, the parallel computation of (5,6,7,8,9,10), (11,12), and 
(13) is reached in two ways: First, (5,6,7,8,9,10,11,12) is 
performed in parallel with (13), point 1, and then the larger 
computation is performed as (5,6,7,8,9,10) in parallel with 
(11,12). The other way of reaching point 2 is by computing 
(5,6,7,8,9,10) in parallel with (11,12,13), point 4, and then 
transforming the smaller subgraph into (11,12) in parallel with 
(13). Notice furthermore, that points 2 and 4 present the same 
time value. The system uses the distance to the origin as a tie 
breaker parameter. 7 


The ‘same example was also processed with the constrains 
that 1) No more than 3 cost units (dollars) were to be added for 
each time unit (1 microsecond) of speed-up, and 2) The time 
should be no greater than the initial solution. With these 
constrains, the system rejected point 5 for not having the proper 
trade-off with respect to its predecessor. It is interesting to 
see that point 3, which could be reached from 1 and 5 under the 
unlimited cost constraint, can only be reached from 1 (since 5 
was rejected, it successors were not obtained). A _ similar 
_ situation is present at point 2, with respect to points 1 and 4. 
The interesting detail is that, 2, when reached from 1 is accepted 
since the trade-off involved is below the threshold. When 2 is 
reached from 4, it is rejected since the trade-off involved is 
bevond the threshold. 


9. CONCLUSIONS 

The purpose of this paper is to describe the development of 
an automated method for designing digital systems at the RT level. 
The designed system is optimized along a set of designer 
constraints. The primary result is a system that translates an 
initial behavioral description of a digital system into alternative 
structural specifications from which it can be built. For 
simplicity, the structural specifications are given in terms of a 
specific set of building blocks, the RTM set. 


Due to space limitations, it is impossible to provide in a 
paper of this nature any detailed description of the system as 
implemented, and therefore we have tried to point out in general 
terms what its capabilities are. | 


The system is a research tool and its implementation allows it 
to be used either as a closed system, in which the user only 
specifies an initial description and a set of constraints and goals, 
upon which the system performs an automatic design space 
explorations; or, as an interactive facility, driven by a command 
language that allows the user to exercise any function of the 
system from a time-sharing terminal. 


A system of this nature presents limitations as to the degree 
of “optimization” it can perform. It is not expected to obtain 
solutions that are radically different from the one specified by the 
user. Hence, its use is more likely to be as part of a design 
cycle, in which the user presents an initial description which is 


processed by the system; the result of this is an exploration of 
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the design space around such initial solution; this exploration can 

suggest to the user modifications to his behavioral specifications 5 

this modified specifications are then fed back into the system and 

the process starts again. 
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SYMBOL HARDWARE COMPILER 
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ABSTRACT 


One of the most outstanding features of the SYMBOL com- 
puter is its high level hardware compiler. This paper 
presents some aspects of the hardware implementation 
including the network characteristics of the communi- 
cation scheme between compiler, system supervisor, and 
Memory Controller, the functional breakdown into 
distinct sections for implementation, the support hard- 
ware (registers, tables, etc.,), the Name Table 
structure, and some of the linking techniques for the 
structured output of the compiler. 


rT. INTRODUCTION 


The main objectives and goals of the SYMBOL research 
project [1,2,4] were to demonstrate the reduction of 
the total costs of data processing by revising the 
designer's approach on the following key items: 


a. Hardware/Software boundaries 
b. System Architecture 
c. System Packaging 


The hardware compiler is one of the best examples for 
demonstrating items (a) and (b) above because, first, 

it was implemented totally in hardware thus representing 
a 100% departure from the classical approach of totally 
software compilers and second, the language used (the 
"SYMBOL" language) [3] broke all barriers of traditional 
restrictions for compatibility with existing languages. 
The SYMBOL SYSTEM consists of the following eight 
specialized processors which operate automously but are 
linked together via the main data and communication bus: 
the System Supervisor (SS), the Memory Controller (MC), 
the Compiler (Translator) (TR), the Central Processor 
(CP), the Channel Controller (CC), the Input Processor 
(IP), the Disc Channel Controller (DC), and the Memory 
Reclaimer (MR). 


The Compiler takes as its input a program written in 
the high level procedural "SYMBOL" language. The 
program has been deposited in the Memory by the IP. 

The Compiler then generates a reverse polish object 
string and a multi-level block structured name table 
suitable for execution by the Central Processor. In 
the process of doing this, the Compiler uses a small 
table of Reserved Words (about 100) which are kept in 
the non-pageable portion of main memory and a library 
of call-by-name system procedures stored in the pageable 
portion of main or bulk memory. The compiler manages 
its own communications with the Memory Controller and 
the System Supervisor. All of the above objectives are 
accomplished totally in hardware. 


II. COMPILER OVERVIEW 

Basically, the Compiler can be thought of as a network 
in conjunction with the System Supervisor (SS) and the 
Memory Controller (MC). See Figure 1. For reasons of 
compatibility with previous SYMBOL references, the 


compiler will hereafter be referred to as the 
Translator (TR). Each mode of communication will be 
discussed in detail later. 
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Figure 1. SS, TR, and MC Communication 


The overall block diagram of the Translator and its 
parts of communication with the SS and MC are shown on 
Figure 2. As indicated there, the TR picks up its in- 
put (source program) from some location in storage 
called TWA (Transient Working Area) and deposits its 
two structured outputs. (Object Code and Program Name 
Table) in other locations of storage. It also com- 
municates with the SS which maintains the Terminal 
Control Headers, the Task Assignment Queues, the Page 
Out Queues, and handles the Error and Interrupt 
analysis. 


From the standpoint of hardware implementation, the TR 
is divided into three major sections as shown in 
Figure Z: the Object Code section, the Name Table 
section, and the Support Hardware section. Only the 
Name Table and Support Hardware sections will be dealt 
with in this paper. From the functional standpoint, 
only one of the three sections can be active at a time. 
Either one of the two logic sections (Object or Name 
Table) can request action by the Support Hardware but 
once the Support section has been activated, the logic 
section that requested the action freezes until the end 
of the Support activity at which time it continues on. 
During compilation, the operation of the two logic 
sections is a Ping-Pong-like action. The Object 
section processes all non-literal single characters 
(delimiters) and structured alphanumeric data until it 
comes to a blank space followed by a letter; this could 
be either a Reserved Word or an identifier. At that 
point it turns control over to the Name Table section. 
The Name Table section resolves the name and gives 
control back to the Object section for processing. 
Thus, the control bounces back and forth between the 
two sections until the end of the source program. At 
that time the Name Table section takes over and per- 
forms the resolution and linking of all identifiers 
(Global Linking). 


II!I. SUPPORT HARDWARE 
A. TR-SS COMMUNICATION 


TR-SS Communication takes place during Control Exchange 
Cycles (CEC). During a CEC, a certain allocation of 
bus lines is used for communicating information between 
the eight processors in the system. 


The process of compiling a job begins at the end of the 
Load Mode administered by the IP. At that time the 
Input Processor (IP) notifies the System Supervisor (SS) 
during a CEC, that it has finished inputing a progran, 
the SS then puts that program (job) at the bottom of 
the Translator queue and also initializes the Terminal 
Header Control Words [5] with the appropriate pointers 
to the beginning address of the source code and to the 
beginning address of the object code which is to be 
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Figure 2. 


generated by the TR. 


When the job percolates up to the top of the TR queue, 
the SS initiates a Control Exchange Cycle and sends a 
start command to the Translator over the control bus 
along with the Terminal (user) number of the Terminal 
that inputed the job. 


At this point, the Translator becomes activated, looks 
at the terminal number, and begins work on the job by 
first fetching the Terminal Header Control Words to 
find its pointers. The Translator is now on its own 
and, from this point on, it can be stopped only by the 
occurrence of one of the following conditions: SS 
Interrupt, Program Trap, Page Out, Program Error, and 
Completion of Task. 


In each one of these cases, the TR saves its status in 
the appropriate terminal header control words, 
initiates a CEC, and transmits a completion code to 
the SS during the CEC. The SS analyzes the code and 
takes the appropriate action. Specifically, in the 
case of Program Error, the TR saves enough information 
so that the SS under system control will print out at 
the user's terminal the type and location of the 
syntax error. In the case of a Page Out, both SS and 
TR sense the Page Out from the Memory Controller (MC). 
The TR starts its shutdown procedure, the SS performs 
some housekeeping for the TR Page Out but does not 
wait for the TR completion code. It goes on with 
whatever other tasks it may have in its queue until a 
Page Out completion is received from the TR. The page 
is then put on the paging queue, the Terminal Header 
Control Work (THCW) of the task is marked to indicate 
waiting for a page and the job is put on the bottom of 
the TR queue. Another task is now assigned to the TR. 
When the page has been brought in by the MC, the THCW 
is marked to indicate that the page is now in. The 
next time the task percolates up to the top of the TR 
queue again, the SS restarts the TR on that job. The 
TR, during its shutdown process, saves the following 
information: Name Register, Stack Register, Object 
Register, All Address Registers, Phase Counter Status, 
all pertinent flags, and source character pointer. 
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Block Diagram of Translator 


B. TR-MC COMMUNICATION 

The SYMBOL system features a Dynamic Memory Management 
capability via the Memory Controller which allocates 
memory space on demand, performs address arithmetic, 
and manages the associative memory needed for paging in 
its virtual memory environment. 


The Translator is one of the heaviest users of Memory. 
Besides its input-output interaction with memory (fetch 
source program, store object string, and name table), 
it performs many searching operations. For every 
English word that appears in the source program, the 
Reserved Word Table (RWT) has to be searched. If it is 
not found in the RWI, the current block of the Name 
Table has to be searched. At the end of the program 
the entire Name Table has to be searched again for 
Global Name resolution and linking, procedure call 
handling, and system name resolution. Thus, to avoid 
situations where the TR would tie up the memory during 
long searches, the TR was given a relatively low memory 
access priority (fourth priority out of five; after SS, 
IP, CP, but before MR). 


Figure 3 shows the TR-MC communication in a block 
diagram form. Typically, a phase of a particular Task 
Phase Counter requests a memory cycle by raising the 
MOP (0-3) lines and holds at that phase. The Memory 
Communication Interface Section takes over. It puts 
out TR's Memory request line (MP4). When the Memory is 
free and there is no higher priority request, it puts 
the command on the bus and unloads the registers. A 
few clocks later, the MC returns a completion code 
during a CEC. If the Memory operation was successful, 
the TR loads its Registers from the main bus during the 
clock time following the CEC and the advance signal 
(MDONE) is issued to allow the logic phase counter to 
move on to the next phase. 


If a page out completion was returned, the MDONE signal 
is not issued. Thus, the logic phase counter freezes 
and the shutdown routine takes over and saves the TR 
status, including the state of the phase counter. 


During the period between cycle granting and 
completion code return, from the MC, the main bus can 
be used by the CP on a cycle-stealing basis for intra- 
CP data communication between the various sub-units of 
the CP (Instruction Sequencer, Arithmetic Processor, 
Reference Processor, and Format Processor). 
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Figure 3. 


C. TR REGISTERS 


The TR utilizes four Data Registers and eight Address 
Registers. See Figure 4. One of the basic functions 
of the TR is character processing (full word = 64 bits, 
one character = 8 bits). For this reason, the four Dat 
Data Registers (Source, Name, Object, and Stack) have 
very flexible single character control as well as full 
word capability. Each Register has different capabil- 
ities depending upon the function it performs. The 
Source Register is used to fetch and hold the current 
word of the source program under compilation. It gets 
loaded in the full word parallel mode from the Main 
Data Bus, but it only outputs a single 8-bit character 
at a time for decoding and interpretation. The Name 
Register is a working register used for building and 
holding identifiers currently under consideration for 
or from the Name Table. It is the most versatile 
Register because it has both single character and full 
word capability in both its input and output. The 
Object Register is used to hold the Object code 
generated by the TR and store it in Memory. It needs 
only single character input capability but full word 
output capability. The Stack Register, used for main- 
taining and manipulating the stack, also has both 
character and word capability for input and output. 


Typically, each Data Register is associated with a 

three-bit counter and a three-bit register to achieve 
character control. The three-bit register is referred 
to as the pointer. It gets loaded in parallel and it 
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Figure 4. TR Registers 


points to one of the eight characters in the Data 
Register for reference reasons. The three-bit counter 
is an up-down counter with parallel loading capability. 
It usually gets loaded in parallel from the pointer 
register. Thereafter, it responds to count-—up or 
count-down (forward/backward) commands. The eight 
decoded states of the counter combined with the Read/ 
Write command provide the selection signals for 
character selection in the Data Registers. 


The eight Address Registers are named Address Register 
1 through Address Register 8 (AREGI-AREG8). Each AREG 
consists of 24 bits. All eight registers communicate 
with the Memory. However, AREGI-AREG4 also communicate 
with the left half of the Data Registers (characters 2, 
3, 4) and AREG5-AREG8 also communicate with the right 
half of the Data Register. 


IV. NAME TABLE SECTION 


Most compiler systems do not use a separate name table. 
Address references to data space are contained in the 
program string. 


One of the most distinguishing features of the SYMBOL 
compiler is the use of a separate Name Table during 
execution. In this way, the program string contains 
only references to the Name Table entry which, in turn, 
contains all the pertinent information and pointers, 
for the NAME. Any future change in the parameters will 
affect only the Name Table entry. 


A. NAME TABLE CONSTRUCTION 


Control is given to the Name Table logic by the Object 
logic section with the source register pointer pointing 
to the first character of the potential identifier. 

The Name Table logic starts searching the Reserved Word 
Table (RWI). If a match occurs, it puts the code on 
the bus and turns control back to the Object Section 
for processing the Code. If there is no match in the 
RWI, it determines the boundaries of the current word 
by searching and locating the next delimiter in the 
source string. Now, having the exact size of the 
identifier, it starts searching the current Name Table 
block. If a match occurs there, it puts the address of 
its Control Word on the appropriate Address Register 
and gives Control to the Object Section for processing. 
If no match occurs in the current block, the identifier 
is considered as local (by default) and it gets 
inserted at the bottom of the Name Table. Its Control 
Word is created in the next assigned memory location, 
and the address of the Control Word is placed in the 
appropriate Address Register. Control is now given 
back to the Object Section. 
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Figure 5. Name Table Construction Flow Diagram 
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Figure 5 shows the overall flow diagram of Name Table 
construction. The Name Table consists of one or more 
blocks that can be nested as shown in Figure 6. There 
is no hardware limit to the degree of nesting even 
though Global declarations carry identifiers up only 
one block level. | 


A 
B 


= 


Figure 6. Block Structure of Name Table 


B. BLOCK ORGANIZATION 


The basic scheme of block organization is shown in 
Figure 7. There is a Block Start Control Word at the 
beginning of each block that contains linking and 
status information concerning the whole block. The 
body of the Block consists of VFL identifiers followed 
by their Control Words. The Control Word of the last 
identifier is properly marked to signify the end of 
the block. Figure 8 shows the block linking 

for the block structure of Figure 6. Thus, the 
Forward Link threads all blocks in the program, 
starting with the outermost block. This link is fol- 
lowed during the Global Linking phase in order to go 
through every block in the program and make sure that 
all identifiers are resolved as either being local to 
the block or global to the enclosing block or to the 
system (as in procedure calls, etc.). The Back Link 
is followed again during Global Linking to search 
enclosing blocks. From the outermost block there is 
an automatic exit to the System Name Table if there 
is a possibility for the identifier to be System 
procedure. 


The basic search mechanism, from the hardware stand- 
point, uses two data registers and two address reg- 
isters. One data register holds the name under con- 
sideration and the other holds the current name of 


the block being searched. A character-by-character 
comparison is administered until either a mismatch 
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Figure 7. Name Table Block Organization 


occurs or the Control Word of one identifier is 
reached. This means that the comparison has failed. 
If the Control Words of both identifiers are found 
simultaneously, then the comparison is successful and 
the appropriate linking occurs. The Address Registers 
are used in conjunction with a memory command (fetch 
and follow, follow and fetch, Store and Assign, etc.) 
when crossing a word boundary to fetch the next word 
or to store the Control Word back in memory after 
linking. 


Figure 8. Block Linking Method for the Block Structure of Figure 6, 


V. RESERVED WORD TABLE 


The Reserved Word Table is a list of the words used in 
the internal character set as part of the SYMBOL lan- 
guage syntax. The table is stored in an area of the 
memory which is non-pageable but enjoys the automatic 
incrementing and link following capabilities of the 

MC in order to facilitate searching. The list is 
arranged alphabetically. Each Reserved Word occupies 
as many Memory words as needed. 


The code for each RW is stored in the last character 
of the last memory word occupied by the RW. Thus, in 
the case of ABSOLUTE, as shown in Figure 9, the code 
99 had to go in the next Memory Word. The address of 
the first word in the list of the RWT for each letter 
of the alphabet is kept in a link table that occupies 
the first four words of the first group of the page 
that holds the RWI. The table, as shown in Figure 9, 
is arranged so that the address of the link for each 
letter is directly related to the code for that letter. 
Thus, a portion of the code of the word's initial 
letter is used directly as the address to fetch the 
link of the first word in the RWT. 


Figure 9. Reserved Word Table Organization and Linking 


The code for letter A, for example, is /41 = 01000001. 
Thus, by using the last three bits (001) we can 
address directly the link for letter A which is stored 
at character 1 of word 0. This link will now point us 
to the address of the first word of the RWT that 
begins with A. Now we begin comparing the source pro- 
gram word with the RWI. If a mismatch occurs or if we 
reach the RW code before the end of the source word, 
we move on to the next RW. If the first character in 
the next word fails to match, then we have exceeded 
the list for the particular letter. Therefore, the 
source word under consideration is not a Reserved 
Word. 


VI. SYSTEM LIBRARY 

The system library consists of two parts: the System 
Name Table, which serves as an index to the system 
programs, and the system programs themselves. 


A system name (System Procedure) is the name of a pro- 
gram stored in memory as part of the system library 
that contains frequently used programs and service 


programs. There are two types of system library 
programs: 
A. Restricted System Programs (RSP) 


A restricted system program can only be called 
(used) by privileged users. A privileged user is 
either a privileged terminal or a privileged 
system program. 


Nonrestricted System Programs (NSP) 


t anf two 


Non restricted system programs consist of ft 
types: Privileged System Programs (PSP) and 


Common System Programs (CSP). 


The System Name Table consists of program names 
(identifiers) in the VFL form followed by one control 
word. The control word holds the address that points 
to the system program somewhere in core and also dis- 
plays information about the type of system program 
(RSP, PSP, CSP) and the status of the Name Table at 
that point (Table Start, Table End). 


There are two different Name Tables, one for the RSP 
and one for the NSP. The two main reasons for the two 
different tables are: flexibility of library manipu- 
lation and speed-up of search. 


The address of the beginning of these tables is held 
in the Header area of the terminals (CH2). Thus, pos- 
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Figure 10. System Library Organization 
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sibly, although not necessarily, each terminal could 
have its own library or have no access to system 
library at all or a group of terminals could have the 
same library. This type of arrangement is primarily 
aimed at keeping the system library, or parts of it, 
out of the reach of unskilled users or users who have 
no need for it. It is not intended that each terminal 
have its own library because there will be a fair 
number of system programs that will be needed by many 
terminals. Repetition of these program names in every 
terminal's library will use up too much space in memory. 


Referring to Figure 10 which shows the system library 
structure, the following observations can be made: 


nonrestricted system programs (NSP) may be called 
by any user (terminal or a program); 


restricted system programs (RSP) may be called 
only by privileged users; 


privileged users are a privileged system terminal 
or a privileged system program; 


an RSP or a PSP can call any other RSP or NSP; 


a CSP can only call a PSP or another CSP. 


Vlas CONCLUSION 

Even though a microprogram based hardware compiler 
would have given the system greater flexibility, the 
present compiler has proven that a hardware compiler 
was not only possible but also reasonably successful 
even with the technologies of the late 1960s. The 
software empire which grew so big so quickly in the 
last decade, was, for the first time, seriously 
assaulted by the SYMBOL compiler. 


SYMBOL was meant to be an experimental machine. 
are many approaches that a designer can take in 
implementing a hardware compiler. The SYMBOL compiler 
represents only one approach which in totality may or 
may not necessarily be the ultimate in efficiency. 
However, some of the algorithms developed and proven 
will continue to form the guidelines for some time to 
come. 


There 
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ABSTRACT 


This paper presents the architecture of a context- 
addressed cellular system for non-numeric information 
processing, using an inexpensive, large-capacity 
circulating memory device. The system allows data to 
be represented in a structure very close to the form as 
the user perceives it (information structure) and allows 
the search operations of high level queries to be 
implemented directly. The information structures 
currently used in existing information systems are 
described. Then the architecture of the system as a 
whole is presented, as well as the implementation of 
these information structures as basic data types and 
hardware management of storage allocation and garbage 
collection. 


The paper intends to demonstrate that distributing 
intelligence throughout a rotating memory device can 
decrease the time required for search operations in 
large data bases. And that the search strategy and 
storage management functions can be efficiently carried 
out in hardware, greatly simplifying the software of 
information systems. Thus, data not only becomes faster 
but easier to access, verify, insert and delete. 


1. INTRODUCTION 


Most existing information systems are implemented 
on general purpose von Neumann type computers. Von 
Neumann processors have serious inherent limitations 
when they are applied in non-numeric information pro- 
cessing. We shall discuss some of the limitations with 
respect to both retrieval language and storage of data 
in information systems. 


Due to the serial nature of von Neumann processors, 
the time required to access a data item if nothing is 
known concerning its physical location varies linearly 
with the size of the data base. In order to minimize 
this access time, the information structure (structure 
of data as the user perceives it) is transformed into 
a data structure designed for efficient access on von 
Neumann processors. The data structure is then mapped 
into a machine dependent storage structure. These 
three levels of data representation are found to be 
essential in the design of a file system using von 
Neumann processors (Wang and Lum 1971). The inter- 
mediate access path level introduces complexities in 
both the storage of data and the retrieval language. 
Since the information structure and data structures 
are usually quite different, the storage structure 
does not closely resemble the format of data as the 
user perceives it. Also, complicated procedural steps 
are introduced which are basic operations of von Neumann 
processors rather than basic operations of the high 
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level language of the user. These additional structures 
and procedural steps are both greatly complicated by 
the need to schedule paging of large amounts of in- 
formation between discs (where the data base must be 
stored) and the primary memory of the processor. It 
is now widely accepted that provisions for isolating 
the user from the data and storage structure levels is 
one of the major objectives of a data base system 
(CODASYL report 197la-b, Engles 1970, and Guide/Share 
report 1971). However, this has proven to be a very 
difficult and expensive task in software. This paper 
presents a hardware solution to these basic problems. 
The following paragraphs indicate the approach we have 
taken. 


If all operations on the data base are done directly 
in (fixed head) disc memory where the entire data base 
is stored, then the excessive paging is eliminated. 
Also, parallelism is used to make the time to search 
the data base independent of the data base size. This 
eliminates the need for an intermediate access path 
level, because the entire data base is searched by 
hardware for each search instruction. Thus, the 
parallelism inherent in high level retrieval languages 
can be implemented without the need to translate the 
specification of what is desired by the user into 
complicated procedural steps. Data can be stored in a 
format which is very close to the user's information 
structure, removing the data representation at the 
access path level from the task of data definition by 
the user. 


The idea of using distributed intelligence in 
inexpensive, large capacity, circulating memory devices 
has evolved slowly. Partially associative devices have 
been suggested (Hollander 1956, Parker 1971, Minsky 
1972). They allow name-value pairs as the basic data 
type and allow only the name part to be searched. 
Content associative devices (Fuller et.al. 1965, 
Parhami 1972) allow the value part to be searched also. 
String, substring and template searches have been 
examined (Healy et.al. 1972) on a context addressed 
disc. The context-addressed, segment-sequential memory 
(CASSM) described here offers several advantages over 
these devices. It allows widely used information 
structures (such as trees, sets, graphs and relational 
tables) to be implemented as basic data types of the 
machine. Also, the task of storage allocation and 
garbage collection is taken over by hardware. A more 
detailed discussion of software advantages and con- 
siderations is given in Su et.al. (1973) 


In section 2 we describe the various information 
structures widely used in non-numeric processing. In 
section 3, the hardware of the CASSM which implements 
these high level data types and automatic storage 
allocation and garbage collection is presented. A 


summary and conclusions are given at the end. (a) S.P.P# : S.S# = 2 


S) 
(b) S.P.PNAME : S.S# = 2 
(c) S.(S#, STATUS) : S.CITY = 'LONDON' 
2. INFORMATION STRUCTURES OF NON-NUMERIC PROCESSING (d) S.(P.P#, CITY) ¥S.P.P# 
In non-numeric processing, several information The qualification on the right of the colon is 
structures have become useful in representing in- simplified by the fact that specific hierarchical 


formation. They are the directed graph or network model dependencies are given in the specification on the left 
(CODASYL 1971la), the relational model (Codd 1970, 1971) using qualified names and parentheses. 

and the tree or hierarchical structure, which is 

commonly used in data processing systems. Information Statements like the queries above can be either 

is represented in each of these structures in general implemented as basic operations of CASSM or easily 

as a set (record) of attribute-value pairs in each node broken down into basic operations. The details of 

or table entry. These are called information structures how this is done is included in section 3.2 and 3.3. 
because the user views his data as being displayed most 


naturally in these structures, and because operations of SUPPLIER (S) TABLE SUPPLIER-PART (SP) TABLE 
his data involve specifying parameters that are also MEL DULERS ES Bp ge: Gree ee ee 
parameters of the structures. SEQUENCE POSITON | 1 4 
SEQUENCE 
POSITION 
In order to see more concretely which operations eee 
are performed on these information structures, let us eens PARIS 


examine an example inventory file taken from J.C. Date 
(1972) in his tutorial description of Codd's work on 
relational files. The logical relations among the 
data fields in the file can be represented by a tree 


LONDON 
| ATHENS 


UPR RWWNNRPE PR PR 
UPUWNNERWEPHENE NW 


structure as well as by a network structure. They can ATTRIBUTE SET COLOR WEIGHT 
also be represented by E.F. Codd's (1970) normalized SEQUENCE POSITION 
relational form involving three relational tables as 
in Figure 1. The tables show a many-to-many mapping 
e e ° VALUE 
of suppliers and parts (each supplier supplies many SETS 


parts and each part is supplied by many suppliers). 
Each table defines a relation with the domains of the 
relation shown as the headings of the columns. Each 
row contains a set of attribute (name)-value pairs, FIGURE 1. INVENTORY FILE IN CODD'S NORMALIZED RELATIONAL FORM 
where the attributes are listed as domain headings. In 

the example, Date shows four possible queries to be 


satisfied: INVENTORY FILE 
S (S#, SNAME , STATUS, CLTY) 
(a) find part numbers for parts supplied by supplier 2; P (P#, PNAME , COLOR, WEIGHT) 
(b) find part names for parts supplied by supplier 2; SP (Si, P#,QTY) 
(c) find supplier numbers and status for suppliers in 
London; FIGURE 2. SKELETAL DESCRIPTION OF FILE IN CODD'S NORMALIZED RELATIONAL FORM 


(d) for each part find part number and names of all 


cities from which the part may be obtained. 
ATTRIBUTE SET 
FOR LEVEL 1 SNAME STATUS CITY) 


In Figure 1, the supplier-part (SP) table shows 
how the many-to-many mapping is handled in the re- 
lational model using redundant data values to link poriapesetn ir gee 
tables by contents rather than by addresses. If the 
user is supplied with the skeletal description of the 
table arrangement as in Figure 2, he may specify each 
of the above queries in a non-procedural, calculus 
type statement similar to that developed by Codd (1971) 
and given in Date: 


SEQUENCE POSITION 3 4 


SEQUENCE POSITION 


LONDON 


(a) SP.P# : SP.S# = 2 
(b) P.PNAME : 3SP((P.P# = SP.P#) A (SP.S# = 2)) 
(c) S.S#, S.STATUS : S.CITY = 'LONDON' vee 
(d) P.P#, S.CITY : ASP((P.P# = SP.P#)A 
(S.S# = SP.S#)), ¥E.P#. 


LONDON 


The data items in the above statements are spe- 
cified by qualified names as those used in COBOL and 
PL/1. The expression on the left of the colon specifies 
what is to be retrieved and the expression on the right 
is a qualification. For example, the first statement 
is a query for retrieving all part numbers (P#) 
supplied by supplier number 2 (S#=2). FIGURE 3. INVENTORY FILE IN A HIERARCHICAL FORM 


This same inventory file can be put into a hier- 


archical form in which suppliers are superior to parts INVENTORY FILE 

as in Figure 3. If the user is given the skeletal S(S# SNAME STATUS CITY) 

description of the file as in Figure 4, he may express p(P# PNAME COLOR WEIGHT QTY) 

each of the queries in a non-procedural, calculus type 

statement as follows: . FIGURE 4. SKELETAL DESCRIPTION OF FILE IN A HIERARCHICAL FORM 
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3. GENERAL DESCRIPTION OF HARDWARE 


In order to fully exploit LSI technology, CASSM 
consists of a chain of identical cells. Each cell can 
communicate directly with its two neighboring cells 
and with an IO bus common to all cells. Each cell 
consists of two parts: a circular, sequential segment 
of memory (such as, a disc track, a circular charge- 
coupled device, or a magnetic bubble device) and a 
logic section. All segments of memory circulate con- 
currently and in syncronization, while each logic 
section reads, searches, modifies and rewrites its 
segment of memory from one end to the other. Thus, 
all segments of memory are operated on in one cir- 
culation of memory. A read and a write head per track 
(segment) is required for implementation on a set of 
discs. The conceptual arrangement of the hardware in 
each cell is illustrated in Figure 5. The remaining 
sections of this paper describe the function and im- 
plementation of the submodules of this figure. 
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FIGURE 5. 


The regularity involved in having identical cell 
logic and all memory segments equal in length is 
desirable for cost effective hardware implementation. 
However, information structures used in non-numeric 
processing are highly variable in length. To require 
the user to partition his structures to fit into these 
equal length memory segments would lead to.inefficient 
utilization of memory and to greatly increased software 
costs. In CASSM, variable length structures are divided 
into equal length segments for high utilization of 
memory as in Figure 6. Each segment may contain only a 
part of a record, a whole record, or several records 
of a file. Submodules within each cell of Figure 5 
allow operations on variable length structures that 
overlap any number of segments. The forward and back- 
ward marking facility provided by a one bit random 
access memory (RAM) will be described in the following 
section. 


SOFTWARE MAKEUP HARDWARE PLACEMENT 


FIGURE 6. STORAGE OF RECORDS AS SEGMENTS 


3.1 Forward and Backward Markin 


In section 2, we pointed out that high level (non- 
procedural) statements have two distinct parts: a 
specification (S) of what is to be marked and a 
qualification (Q) that must be met for the marking to 
take place. Each query involves searching and con- 
ditionally marking all occurrences of S-Q pairs through- 
out the data base. If the occurrences of the pairs 
overlap one another, i.e., if any element of a pair 
occurs in between the elements of another pair, 
then the search would take more than one disc 
revolution. If the data base is stored such that for 
each query, each occurrence of an S-Q pair referred to 
by the query does not overlap any other pair in the 
sequence, then the pairs can be operated on one at a 
time as memory sweeps by. We need only enough hardware 
to operate on one pair at a time. Using two comparators 
within each cell (one for S and one for Q, as in 
Figure 5) allows S and Q to be searched for during the 
same sweep of memory. However, these two parts may be 
separated by an arbitrarily long distance with much 
data in between. If Q occurs before S in the sequence 
(forward marking), this does not present any problem to 
implement. The one information bit regarding the 
success of the Q search can be saved until S is found 
and conditionally marked later in the sequence. But 
if S occurs first, then it cannot be marked until Q is 
found and satisfied later in the sequence. We need 
some way to access a mark bit of S after Q is searched 
(backward marking). 


This can be accomplished by having a set of mark 
bits that can be accessed independently of the position 
of the circular memory segment and with a simple hard- 
ware method of mapping the mark bits to data items on 
the segment. Figure 7 shows how a small one bit wide 
RAM within each cell is used to do this. A counter 
initially set to zero at the beginning of each segment 
revolution is used as a hardware pointer to the one 
bit RAM. The beginning of each data item indicates 
that the counter is to be incremented (using a special 
delimiter bit or symbol) so that the counter points to 
a unique marker bit for each data item in a segment. 
We have a 1-1, onto mapping of marker bits to data 
items. Although a RAM is being used, data items are 
not tied down to a physical location and items may be 
of variable length. Only their relative positions in 
the sequence are important. 
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FIGURE 7. HARDWARE FOR MAPPING DATA ITEMS TO MARK BITS 


3.2 Implementation of Information Structures 


In this section we shall describe the implemen- 
tation of the various information structures widely 
used. The organization of data and the search 
operations in the disc system will be described. 


3.2.1 Trees or Hierarchical Structures and Sets 


Information is represented in a tree or hier- 
archical structure as a set, record or tuple of 
attribute (name)-value pairs in each node of the tree. 
A set is linearized by simply listing each set member 
sequentially. A set member can then be accessed by its 
attribute or position in the sequence and by its value. 
To greatly improve storage efficiency, sets are of two 
types, attribute sets and value sets. An attribute set 
can be placed in front of each value set or in front 
of a large number of value sets which have the same 
set of attributes. Thus storage efficiency is improved 
by not repeating identical attribute sets. A sequence 
position counter CSP, incremented by data item de- 
limiters as in Figure 7 but reset by beginning set de- 
limiters, indicates which set sequence position is 
currently being examined. If set members are accessed 
by their attribute, then two segment revolutions are 
needed. During revolution 1, the specified attribute 
is searched for in the attribute sets and marked when- 
ever found. During revolution 2, the sequence position 
of the marked attribute (provided by CSP) is saved in 
a register RSP and compared to CSP during examination 
of the subsequent value sets. Value sets need not be 
stored in the same cell as their attribute set. Some 
value sets may be separated by several cells from their 
attributes. The following procedure allows hardware 
communication of the sequence position number from one 
cell to the following cells. 


During revolution 1 above, whenever an attribute 
is found to match, the sequence position in CSP is 
stored in RSP. Between revolutions, the contents of 
RSP in a cell containing an attribute match is pro- 
pagated to the following cells. A one bit register 
Rl is used to cut off the propagation of the sequence 
position at the cell which contains the last item in 
the subtree that is presently being searched. R1=1 
indicates that the cell contains either a value set 
which has a level number less than the level number 
presently being searched or an attribute set which has 
a level number equal to the one presently being 
searched. Between revolutions, the contents of RSP 
in each cell is sent to cell itl and stored in the RSP 
of cell itl. If R1=0 in cell itl, (indicating that 
the cell is not at the end of the subtree) then the 
contents of RSP of cell itl is also sent to RSP of 
cell i+2. The procedure is repeated until the sequence 
position of the specified attribute reaches the end of 


its range over segments. Thus the correct sequence 
position number is prestored in the RSP of each cell 
before revolution 2. Revolution 2 is then executed 
exactly as described above. One bit of communication 
between adjacent cells is required. 


A tree can be linearized in several ways (Knuth 
Vol. 1, 1969). If the tree is written in preorder with 
level numbers included with each node, then it is. 
uniquely specified. The tree level number becomes part 
of the addressing specification. Also, for a given 
node at level 2, its entire subtree is listed before 
the next occurrence of another node at level 2. Thus 
a node at level k and any member of its subtree at 
level 2 (k<xg) forman S-Q pair that does not overlap 
with any other pair at levels k and 2. This provides 
a convenient method for marking forward (down the tree) 
or backward (up the tree). For example, all ancestor 
nodes at level k can be marked if a search within one 
of its successors at level 2 (k<2) is successful. This 
can be done using the backward marking facility of the 
one bit RAM in the following way. We consider first 
the operation within one segment of memory. 


The ancestor is encountered first because of the 
preorder in which the tree is stored. The RAM address 
of the ancestor mark bit is saved for reference until 
the successor is searched later in the sequence. If 
the search is successful, then the ancestor mark bit is 
set using RAM address that was saved. With the tree 
stored in preorder together with level numbers, the 
above algorithm simply involves remembering the RAM 
address of the last node at level k. Marking forward 
(down the tree) is much simpler. If a descendant is 
to be marked whenever an ancestor is successfully 
searched, the only thing to be remembered is whether 
or not the search was successful. The same hardware 
communication can be used between set members within the 
Same tree node since any two members of the same set 
from an S-Q pair which does not overlap any other such 
pair in another set. Three one bit registers (RQ, R and 
RS) and two 10 to 12 bit registers (RB and RF) compose 
the basic hardware needed in the tree/set submodule of 
Figure 5. The functions of these registers are des- 
cribed below. 


For forward marking RQ is used to save the in- 
formation regarding the success of the Q search. For 
backward marking, RB is used to save the RAM address of 
the most recently encountered occurrence of an S. For 
both forward and backward marking, logic is needed to 
load either the counter of Figure 7 (forward) or RB 
(backward) into the RAM address register, set the bit, 
and initialize the registers for the next pair. 


We now consider the operation on multiple segments. 
Although the above hardware is sufficient for processing 
sequentially encountered S-Q pairs residing on the same 
track, the elements of some pairs may be located on 
different tracks. In order to process these pairs in 
one revolution of memory, additional hardware is 
necessary. 


The following is needed for forward marking across 
tracks. RS is used to indicate at the end of a cir- 
culation of memory whether a Q (first pair member) has 
been satisfied on a segment without encountering an S 
after it on the same track. Thus RS indicates that 
the S occurs on one of the following tracks. RF is 
used to save the RAM address of the S if found in a 
following track for marking at the end of the circu- 
lation of memory. This is necessary because we cannot 
be sure that the Q on the previous segments has been 
satisfied until all memory has been searched. R is used 
to indicate whether at least one occurrence of S has been 
found in a segment. Also, one bit of communication is 


1o4re wired between adjacent cells. The hardware procedure 


is as follows. If RS=1 in cell i at the end of a 
circulation of memory, a pulse is sent to cell itl. If 
R=1 in cell itl, RF in that cell is used to set the mark 
bit of S. If R=0, the pulse is sent on to cell it2. 
This is repeated until the cell is reached that has R=1. 
The marking is then done using the RF of that cell. 


Much of the same hardware may be used for backward 
marking across tracks, since only one mode (forward or 
backward) is involved in each instruction. RS is used 
to indicate whether the elements of an S-Q pair reside 
on different tracks. R is used to indicate whether at 
least one occurrence of Q has been found in a segment. 
RB will be used to save the RAM address of S. Since 
this is the last S encountered, its address is always 
present in RB at the end of a circulation of memory. 
The same communication bit between cells is used except 
the pulse is sent in the opposite direction. With the 
exception of these changes, the hardware procedure is 
the same for forward and backward marking. 


Thus we have the capability of marking a node or 
node member if another node or node member satisfies 
a given condition. This can be done in one disc 
revolution if the communicating elements are not in 
different subtrees. 


3.2.2 Tables, Graphs and the Relational Data Structure 


The tree/set hardware can also be used to aid in 
implementing tables like those of Figure 1 by providing 
a means of communicating between elements within each 
table. If a table is at tree level i, then each row 
(set) in the table is at level itl. All data items in 
the same row are members of the same tree node set. 
Also, several tables may be grouped hierarchically. 


General graphs or networks cannot be linearized. 
However, they can be implemented using tables in two 
ways. One way is to set up a table for each node of 
the graph (16). Each table would contain as rows the 
node names of nodes pointed to by the table node along 
with their corresponding relation or arc names. AI1- 
ternatively, a table can be set up for each relation 
or arc name. Here each table would contain as rows 
the node name pairs that are connected by the table 
relation or arc. Relations of degree n can be stored 
by allowing more than two columns, using the set hard- 
ware. Figure 1 is an example of a set of tables, based 
on relations rather than node names, where each table 
corresponds to a relation name and each row is a set 
of nodes that are related by the table relation. In 
this information structure, cross references between 
tables are specified by content rather than by using 
physical address pointers. Query (b) involves com- 
munication between tables. The execution steps are as 
follows. First, the command SP.P# : SP.S# = 2 is 
executed using the tree/set hardware. Secondly, the 
marked SP.P#'s are used to mark rows in table P having 
the same P#. This requires the communication between 
tables implied by P.P# = SP.P#. Finally, the PNAME's 
within these rows are marked. This is accomplished by 
the tree/set command P.PNAME : P.P#. Two methods are 
described below that accomplish the communication 
between tables in the second step. 


In both of the methods described below, there 
exists the need to distinguish the marked source data 
items from the newly marked destination data items. 
Otherwise, the destination items may be used as source 
items. This can be done by having two sets of mark 
bits, one for source items and the other for des- 
tination items. Only the set of mark bits for des- 
tination items need be independent of position of the 
disc. The mark bits for source items can be stored 


on the disc along with the items. A method would be 
needed in hardware to allow these two sets of mark bits 
to conditionally set one another. Also, in both of the 
schemes described below, a method is needed to indicate 
when all data items have been traversed. This is done 
by resetting the mark bit of each source item when it 
is used, and employing an OR rail (12 and 16) between 
cells to indicate whether any mark bits are still set. 
The most obvious method for traversal between tables 

is to pick up the marked node names to be traversed 
from the source table and use these names to context 
search the destination tables. This uses only the 
tree/set hardware. However, if many items are to be 
traversed, either many comparitors are needed or many 
revolutions (searches) of memory are necessary. The 
only additional hardware needed to implement this method 
is in the form of additional comparitors to speed the 
searching if desired. 


The second method is to prestore in each potential 
source table the RAM addresses of the mark bits of 
node names in the destination table. In Figure l, 
columns with attribute S# in tables S and SP are cross 
references between tables. Using this method, the RAM 
addresses of the SP.S#'s are stored next to the cor- 
responding S.S#'s. Similarly, pointers are stored in 
tables P and SP to cross reference items under at- 
tribute P#. These pointers can be prestored by 
picking up each node name in one table and using it to 
context search the other table as in the first method. 
Except here, each time the search. is completed, the 
pointers stored. During the traversal in query (b), 
the. marked pointers stored next to the SP.P#'s are 
picked sequentially and used to mark the P.P#'s. The 
additional hardware needed to implement this scheme is 
described below. 


No additional hardware would be needed if the 
stored source pointers were within the same segment of 
memory as the data items they referred to. As each 
pointer is accessed sequentially, it would simply be 
loaded into the RAM address register. However, this 
is usually not the case since sizable structures will 
be segmented over many cells. This scheme can be ex- 
tended to handle the general case, where the source 
pointers lie in different segments, by viewing all data 
items in the data base as a sequence of items. For 
example, cell 1 might contain items with global sequence 
addresses 0 to 811, items 812 to 1512 in cell 2, and 
items 1513 to 2301 in cell 3, etc. This concept is 
implemented in hardware by using a register RBSA in each 
cell to store the beginning global sequence address of 
the cell, a register RGSA to store the global sequence 
address, and an adder to compute RGSA by adding RBSA 
and the counter of Figure 7. The last sequence address 
of a segment is provided at the end of each revolution 
by RGSA and stored in register RLSA. As each pointer 
is accessed, its value is compared to both RBSA and RLSA. 
If the pointer is within these bounds, the RAM address 
register is loaded with the pointer minus (using the 
above adder) RBSA and the RAM mark bit is set. If the 
pointer is greater than RLSA, it is sent ot the 
following cell for comparison. If the pointer is less 
than RBSA, it is sent to the previous cell for com- 
parison. This procedure is continued until the pointer 
has migrated to its destimtion cell. A register RP is 
used to hold the pointer and a one-bit register 
indicates whether RP is occupied. Two comparitors and 
two additional bits of communication between adjacent 
cells (one bit for each direction of pointer migration) 
are needed. If the register holding the pointer is 
occupied, a newly encountered marked pointer on the 
memory segment may have to be passed over until the 
next revolution. Thus this scheme may take more than 
one revolution to do all the marking required. However, 
the number will be much smaller than that of the first 
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method if many items must be traversed between tables. 


3.3 Execution of High Level Queries 

This section provides a more detailed illustration 
of the hardware steps required to execute high level 
queries. The storage structure to be queried is pre- 
sented first. 


The relational information structure of Figure 1 
is stored in a very stright-forward way as shown in 
Figure 8, where each row is a set. Sets may contain a 
variable number of information items. The level numbers 
show the hierarchical dependency of the rows to their 
table name. The number in parentheses to the left of 
each data item is not actually stored but is inserted 
in the figure to indicate the global sequence address 
of the item. The numbers in parentheses to the right of 
each data item are a list of the global sequence 
addresses (pointers) to which the item is linked. 
Further details of the storage structure, such as 
whether items should be stored as variable length ~ 
character strings or as fixed length code numbers, will 
not be discussed in this paper. 
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FIGURE 8. STORAGE STRUCTURE OF CODD'S RELATIONAL TABLES 


(EACH ROW IS A SET) 


The comparitor submodules for S and Q in Figure 5 
are identical. A simplified illustration of the hard- 
ware of one of these submodules is given in Figure 9, 
as an example of how the hardware might be arranged. 

A more detailed discussion of the hardware arrangement 
of the comparitor submodules will not be given in this 
paper. The three comparitors are one-bit, serial 
adders, capable of arithmetic inequalities as well as 
exact matches. Each comparitor sets a flip flop when 
the specified comparison is successful. The FF's are 
reset before each item is searched. A MATCH is the 
logical AND of these three FF's. A one-bit register 
ROR is used to allow an ordered search of items. ROR 
is set when a marked item is encountered and reset if a 


tree level number is encountered which is both not marked 


and less than the level number being searched. Thus, 
ROR allows the ordered set search to remain with the 
previously specified subtree. This ordered set search 
is allowed to work over one or more segment boundaries 
as described by Healey (9) except that one item is 
searched per revolution instead of one character. 


126 


INFORMATION 
COMP ARAND 


SET TYPE 
COMP ARAND 
REGISTER 


COMPARITOR 


m TO TREE/SET 


SUBMODULE 
SWT/S 


FIGURE 9. 


SIMPLIFIED HARDWARE ARRANGEMENT OF 
THE S COMPARITOR SUBMODULE 


A simplified microcode is given for S and Q in 
Figure 10 to execute query b. The contents of the 
three comparand registers are given, as well as the 
position of the two switches in Figure 9, for each 
revolution of memory. Mark bits which were set during 
previous revolutions are reset if their data items do 
not satisfy the new marking conditions. Revolutions 1, 
2 and 3 execute SP.P# : SP.S# = 2 (same as query a). 
Revolution 4 transfers the fnark bits of the RAM to the 
storage dependent mark bits for Q. Revolution 5 
executes the marking between tables indicated by 
P.P# = SP.P#. Revolution 6 transfers the mark bits of 
the RAM to the storage dependent mark bits for Q. 
Revolutions 7 and 8 execute PNAME : P#. 


Thus, CASSM executes a rather complex example query 
in 8 segment revolutions, or approximately 80 ms for a 
disc. Furthermore, this time is independent of the 
data base size. Non-numeric information systems im- 
plemented on von Neumann computers must page bulk in- 
formation (much of which is not relevant to the query) 
from discs to primary memory, requiring much time and 
expensive channels. 
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FIGURE 10. SIMPLIFIED MICROCODE FOR QUERY b: P.PNAME : SP((P.P# = SP.P#) A (SP.S# = 2)) 


3.4 Storage Allocation &.:Garb ge Collection {SA -& GC) 


From the software point of view, we would like to 
be able to initially load the data base and insert and 
delete items without regard to physical location. We 
would like to free the programmer from the burden of 
accounting for what data is in which segment of memory 
or its position within that segment. The only aspect 
of SA and GC that the programmer need be aware of should 
be a warning that the data base has exceeded the total 
size of memory. Insertions and deletions should be no 
more complicated than specifying where in the user's 


information structure to insert or delete and the in- 
formation to be inserted. In CASSM, associative 
instructions can be used to mark where to insert or 
delete based on context within the information 
structure rather than physical location. The task of 
making room for new data and repacking memory when holes 
are left by deletions can be done automatically in 
hardware. The scheme to do this is described below. 


Two registers RVL and RT are the basic hardware of 
the SA and GC submodule in Figure 5. As the read head 
picks up data, it is fed into one end of a shift 
register RVL which shifts at the same bit rate as the 
memory segment. A tap is provided for the output of 
RVL at multiples of W from the input, where W is the 
basic word size of the machine. When storage is not 
being allocated or collected (Figure 11-b), the write 
head uses the center tap of RVL as its input. When 
garbage is being collected (Figure ll-a), the input of 
the write head moves over one tap toward the input of 
RVL each time a word marked for GC is encountered on a 
segment. This eliminates that word from the sequence 
in memory. If RVL is not long enough to collect all 
words marked for GC on that segment, they can be col- 
lected in subsequent revolutions. When storage is 
being allocated (Figure 1l-c), the input to the write 
head moves away from the input of RVL, again one tap 
for each word inserted. If RVL is not long enough to 
allocate all words required by an insertion, then the 
last word inserted can be marked so that the remaining 
words can be inserted beginning at that point during 
subsequent revolutions. The contents of the one bit 
RAM can be shifted forward or backward from the point of 
insertion once for each RAM delimiter within the 
insertion or deletion. This scheme provides SA and GC 
within each cell. However, the number of words used 
within a cell may grow too large for a segment to 
hold, or so small that much memory is wasted. A method 
of managing data transfers between cells is described 
below. 


2» 


» 


) 


FIGURE 11. VARIABLE LENGTH SHIFT REGISTER FOR INSERTION AND DELETION 


We choose to pack data toward one end of the chain, 
leaving unused cells at the other end. In order to 
reduce the time required for providing space for 
insertions, some of the available memory is distributed 
among the cells to act as a buffer. Within each cell, 
data is packed toward the beginning of the track. A 
special tag E is used to indicate the end of the used 
portion of the segment. A register RT is used to act 
as a buffer storage for transfers between cells. When 
the number of words used in a cell is too large, RT is 
filled with the last words before the tag E. These are 
then stored at the beginning of the next revolution. 
When the number of words used in a cell is too small, 
RT is filled with the beginning words of the following 
track. They are then stored directly in front of the 
tag E. The register RVL is used in both cases to move 
the bulk of the used data forward or backward within a 


cell. Also, the contents of the one bit RAM can be 
shifted with the data. The counter of Figure 7, which 
counts the number of RAM delimiters in a cell, will 
point to the last bit used in the RAM at the end of 

each revolution. The number of delimiters being 
transferred in RT indicates how many RAM bits to 
transfer, starting where the counter points. RBSA is 
updated by incrementing RBSA once for each delimiter 
passed into a cell, to or from other cells. It is 
possible for the size of RT to be too small to handle 
the number of insertions required in one revolution. In 
this case, the remaining insertions must be made during 
subsequent revolutions as the SA and GC hardware allows. 
Large insertions and deletions can be made in the middle 
of the data base by writing entire tracks forward or 
backward, one track per revolution. 


If the data stored on the segments is free of 
parameters regarding physical location, the above scheme 
can be carried on during the same revolutions (in a 
pipeline fashion) as instruction execution. Also, no 
software intervention is needed for SA and GC. The 
tree/set scheme presented in section 3.2.1 has this 
property as well as the first method presented in 
section 3.2.2 for implementing communication between 
tables. The second method in section 3.2.2 requires 
that pointers, which are dependent only on the global 
sequence address of the data they point to, be stored 
in the segment memory. They require maintenance after 
insertions or deletions are made. However, movement of 
data between cells can be carried on in a pipeline with 
instruction execution since pointers are independent of 
cell number. 


If an insertion is made between items with global 
sequence address a and atl, then all pointers referring 
to items above atl must be changed. The pointers P are 
altered in parallel as indicated by the following 
conditional assignment: 


P<+P+N 


D IFa<P, ¥VP, 


where N.. is the number of delimiters or data items 
within the insertion. For deletions, N. is subtracted. 
If insertions are made at two points in memory, one 
after address a and one after address b, the following 
two operations are necessary: 


P+P+N. IF a<P, ¥P; 


Da 
PP Ns EE DN Ba By 
where Noa and Nop are the number of delimiters within 


insertions a and b. For n points of insertion, n such 
steps are necessary. Instructions implementing the 
above algorithm can be provided to allow this updating 
after insertions and deletions to be done in a simple 
way, and to allow full parallelism to be exploited. 


4, SUMMARY AND CONCLUSIONS 


This paper presents the architecture of a context- 
addressed, segment-sequential memory designed for non- 
numeric information processing. Since it is unrealistic 
to describe the design or evaluation of a hardware 
system out of context of software and application, the 
information structures and retrieval operations 
currently used in the existing information systems are 
first described. The architecture of the system is 
then described to show how various information struc- 
tures can be represented and search operations can be 
carried out directly on bulk memory with little inter- 
vention from the central processor. Hardware storage 
allocation and garbage collection techniques used in 
the system are also detailed. 
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The CASSM processor provides data processing 
capabilities useful for information retrieval in large 
data bases. It offers a much more cost-effective non- 
numeric processing system than conventional information 
systems which use von Neumann processors to perform 
search, store, arrangement, allocation, garbage col- 
lection and other data processing functions. The 
characteristics and advantages of CASSM can be sunm- 
marized as follows: 


(1) In non-numeric processing, it is extremely 
time-consuming to page data in and out of the secondary 
memory and to perform searches by the central processor, 
especially when the data base is large. CASSM allows 
data to be searched in parallel on a set of circulating 
devices, so that search time is independent of the size 
of the data base. This operation can be carried out 
independently of the central processing unit. 


(2) In non-numeric processing, the user shall be 
allowed to work with the data as he sees it (i.e., at 
the information structure level) without having to con- 
cern himself with the internal representation of the 
data. CASSM allows information structures to be stored 
as they are without going through many levels of data 
mapping which are found necessary in conventional 
computers to achieve search efficiency. High level 
search queries specified by the information user can be 
performed by the memory device as basic operations, thus 
simplifying the retrieval language design. This feature 
of CASSM avoids many problems found in information 
systems concerning data reliability, excessive storage 
requirement and structure construction and maintenance. 


(3) The data base of an information system is 
generally dynamic in the sense that the contexts are 
constantly changing. Considerable amount of insertions, 
deletions and modifications need to be performed. 

Memory management is a serious problem and is generally 
handled by software using CPU time. In CASSM, manage- 

ment of memory is greatly simplified by having only one 
level of memory hierarchy and by the SA and GC hardware. 


(4) From the hardware point of view, CASSM offers 
several advantages. The class of sequential memories 
can be very inexpensive. Also as a data base grows, 
more memory units can be added modularly. The logic 
requires only one LSI chip type and interconnections 
are very simple and regular because all the cells are 
identical. 
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DERIVING DESIGN GUIDELINES FOR 
DIAGNOSABLE COMPUTER SYSTEMS 
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ABSTRACT 


Diagnosable computer systems are designed to 
detect and isolate the faults that occur during system 
operation. A number of techniques are available to 
the system designer of diagnosable systems. This 
paper examines a number of these techniques and 
derives a set of design guidelines incorporating them. 


I. INTRODUCTION 


There are many reasons for needing design guide- 
lines before one attempts to design a computer system. 
Perhaps the most important reason is the nature of the 
design process itself. Usually from a rough initial 
specification the design proceeds through a series of 
iterations between software and hardware designers. 
Eventually a prototype machine is actually constructed. 
Only then are many serious problems associated with 
hardware and software discovered. Usually the bulk of 
the effort directed toward developing diagnostic soft- 
ware and procedures does not occur until the machine 
is physically realized. Once the machine exists, it 
is difficult, if not impossible, to significantly 
alter it from its original design. As a consequence 
of the design process, the designer must include 
diagnosability from the very inception of his work, if 
diagnosability is a desired quality in the finished 
system. 


A diagnosable computer system is defined as one 
designed to detect the occurrence of faults before 
errors are introduced into the system operation and to 
aid in their isolation, 


It is difficult to define what the scope of de- 
Sign guidelines should encompass. If design guide- 
lines are too narrow and too specific, then they will 
only be applicable to a narrow class of systems. On 
the other hand, if the design guidelines are overly 
general, chances are they will be of little value, 
Perhaps the best goal is to attempt to develop design 
guidelines which can be related to physically realiz- 
able computer systems. DesSign guidelines hopefully 
should show where the system is sensitive to changes 
in design in order to facilitate obtaining the desired 
qualities in the completed system. 
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Naval Weapons Laboratory, Dahlgren, Virginia. 


If. THE MSG/FTU MODEL 


In this paper, several design guidelines will be 
developed through the use of a computer system model. 
The design guidelines will be derived through the 
development of a number of conditions and relations 
pertaining to the operation of the computer system 
model. The model is general in nature and easily 
relatable to physically realizable computer systems. 
The operation of nearly any computer system can be 
simulated by an appropriate version of the model. In 
many cases, the structure of the model will be very 
Similar to the actual structure of the computer system 
that it simulates. The model is developed from the 
common computer system representation illustrated in 
Figure 1. 


FIGURE 1 


i: 
CONTROL A Computer System Mode 


The model is known as the macro state generator/ 
functional transform unit (MSG/FTU) model and is 
illustrated in Figure 2. The macro state generator 
(MSG) is somewhat analogous to the control unit for 
the representation if Figure 1. The functional 
transform unit (FTU) is analogous to all elements in 
the computer system other than the control unit. Both 
the MSG and FTU are finite devices. Before using the 
MSG/FTU model as a vehicle for analysis, it is necess- 
ary to carefully define its mode of operation. 


FIGURE 2 
MSG/FTU Computer System Model 
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The MSG is connected to the FTU by means of two 
paths. The path labeled C in Figure 2 is known as the 
control path and is used by the MSG for transmitting 
data and command information to the FTU. The path 
labeled R in the same figure is known as the response 
path and is used by the MSG to obtain information 
pertaining to the operation and status of the FTU. It 
is important to remember that the action of a computer 
system is to simulate the operation of some virtual 
machine. Each virtual machine instruction is composed 
of more than one microsequence, A microsequence is 
the smallest basic operation that can be performed by 
the FTU. The MSG controls the execution of micro- 
sequences. That is, it regulates order and timing of 
the microsequences needed to compose the deSired vir- 
tual machine instructions, The operation of the MSG 
proceeds on a discrete time base, That is, MSG 
operations can only be initiated at periodic points in 
time. These points are determined by the quantum time 
of the MSG. The quantum time or qtime is defined as 
the length of the minimum interval that can occur be- 
tween the initiation of operations in the MSG at 
different points in time. All operations in the MSG 
occur at points in time which are multiples of the 
qtime periods. The qtime can be thought of as the 
basic clock cycle time in the MSG. The terminology 
"macro state generator" is derived from the fact that 
the MSG initiates operations in the FTU which can 
cause the FTU to transition through several true se- 
quential machine states before the operation is com- 
pleted. Hence, the modifier "macro" is added to indi- 
cate the scope of the state transitions, 


The FTU contains the elements needed for the 
operation of the virtual machine. That is, it contains 
all functional units, memory elements, and data trans- 
fer paths used in the operation of the virtual machine. 


Operations in the FTU are initiated by the control 
path from the MSG. Information concerning an FTU 
Operation is transmitted to the MSG by the response 
path. The sequences of control information and res- 
ponse information transmitted between the MSG and FTU 
are known as the CR sequences. The complete set of CR 
sequences that can be executed by the MSG is known as 
the repertoire of the MSG. The only way that the vir- 
tual machine can obtain information about the status of 
the hardware is by the execution of CR sequences which 
can manipulate the memory and state of the FTU. Hence, 
the virtual machine program can then attempt to ascer- 
tain the state of tlh hardware by inspecting the memory 
and the state of virtual machine after execution of a 
virtual machine instruction. | 


III. DEVELOPMENT OF THE MSG/FTU MODEL 

The next step in the developing of design guide- 
lines is to examine a series of conditions and rela- 
tions pertaining to the MSG/FTU model. The first set 
of conditions to be considered deal with the nature of 
the information available about the operation of the 
FTU. 


Condition 1 
The MSG can only observe the operation of 
the FTU at a finite number of points. 


Condition 1 is derived from the design of the 
MSG/FTU model. The MSG is finite. Additionally, the 
FTU is finite and only presents a finite number of ob- 
servation points. Also, there are a number of unob- 
servable points in the FTU such as values internal to 
switching elements. 


Condition 2 
The MSG can only observe the operation of 


the FTU at discrete points in time. 


Condition 2 is based upon the definition of the 
operation of the MSG, Any operation in the MSG can 
only occur at a point in time which is a multiple of 
the qtime of the MSG. 


Condition 3 
The MSG is the only element in the model 
that can observe the operation of the FTU directly. 


Condition 3 is a result of the definition of how 
the model operates, 


Condition 4 
The set of FTU observations made by the MSG 
constitutes the only source of information concerning 
FTU operation available to virtual machine level pro- 
grams, 


The only manner in which a virtual machine level 
program can obtain information is through the manipula-_ 
tion of virtual machine memory and status by the execu- 
tion of MSG microsequences, 


By Condition 3, the MSG is the only element which 
can observe the FTU operations. Hence, the observa- 
tions of the MSG are not only the sole source of FTU 
information, but the MSG initiated microsequences are 
the only means by which this information can be trans- 
mitted to the virtual machine level. 


The key observation here is that information con- 
cerning the operation of the FTU is only available to 
a virtual machine level program through’the auspices 
of the MSG. More over, the information available to a 
virtual machine level program is usually only a part of 
the information available to the MSG. 


Much of the difficulty often encountered in at- 
tempting to perform a thorough diagnosis of a computer 
system is due to the design and complexity of the 
system. In diagnosing a computer system of some com- 
plexity, one is not just faced with attempting to diag- 
nose a single combinational or single sequential logic 
system but rather a complex interconnection of combi- 
national and sequential logic systems. Probably the 
most serious problem in performing diagnosis in a 
computer system is that of accessibility to the indi- 
vidual logic system being tested. Usually, it is not 
possible to access the individual combinational or 
sequential logic system by itself from the virtual 
machine level. Often, it is simply not possible to 
create effective and practical diagnostics from the 
virtual machine level due to the design of the system. 


Condition 3 provides assistance in attempting to 
deal with the problem of system diagnosis. Condition 
3 states that the MSG is the only element capable of 
observing the action of the FTU directly. By using the 
MSG, it is possible to utilize microsequences to per- 
form diagnosis on the FTU. Diagnosis by use of micro- 
Sequences is commonly known as microdiagnosis. 
Microdiagnostics have been employed for some time on a 
number of commercially available computer systems (1,2, 
4,5). Common arguments for the use of microdiagnostics 
are usually based on the fact that it gives greater 
access to individual logic systems in the computer. 
Also, use of microdiagnostics releases the diagnosti- 
cian from having to build his diagnostic tests around 
the standard virtual machine instruction cycle of fetch 
instruction, fetch operand, and execute instruction. 
However, the use of microdiagnostics implies that the 
proper CR sequences either always exist in the reper- 
toire of the MSG or that the repertoire of the MSG is 
not fixed and can be altered to meet the needs of the 
microdiagnostician. If the MSG of the system is 
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microprogrammable, then the repertoire of CR sequences 
is considered to be variable. An MSG with a fixed 
repertoire is known as a hardwired MSG. 


Before proceeding to develop relations which are 
concerned with diagnosis of the system, the procedure 
of diagnosing the MSG/FTU system model will be examined. 


In the case of either the use of a microprogrammed 
MSG or the use of a hardwired MSG, the MSG is the first 
logic system to be diagnosed because without assurance 
of the integrity of the MSG, no diagnosis of the FTU is 
possible. In the case of a hardwired MSG, diagnosis 
can present extremely serious problems. Normally, 
hardwired units do not possess a regular logical struc- 
ture. As a result, external diagnosis by human diag- 
nostic test input can be an extended project. Diagno- 
Sis of a microprogrammed MSG can be much more straight- 
forward. Normally, a microprogrammed MSG unit possess- 
es a much simpler and more regular logic structure than 
a hardwired MSG unit of similar ability. Additionally, 
Since the repertoire of the MSG is variable, certain CR 
Sequences can be included to aid the human diagnostic 
tester in performing diagnosis on the MSG. Once the 
integrity of the MSG has been verified, diagnosis of 
the FTU is the last step to be performed in diagnosis 
of the MSG/FTU system. 


Diagnosis of the FTU can be performed by one of 
two different methods. First, diagnosis of the FTU can 
be performed through the use of virtual machine in- 
structions which initiate a number of CR sequences 
determined by the type of virtual machine instruction. 
Second, individual CR sequences can be used to ex- 
pressedly diagnose portions of the FTU. In the case of 
a hardwired MSG, only the first method is available, 
while in the case of a microprogrammed MSG both methods 
can be utilized. Upon completion of FTU diagnosis, the 
system will be completely diagnosed. 


In a hardwired MSG, one of the most difficult 
areas to diagnose is the circuitry used to generate the 
microsequences. Commonly, this is a microsequence 
selection matrix and associated timing sequence cir- 
cuitry. For a background on the operation of such 
units, see Rosin (6). In actual practice, the control 
unit or MSG of systems is diagnosed by executing a 
virtual machine program. If the virtual machine pro- 
gram executes properly, the maintenance enginger assum- 
es that the MSG or control unit is operating properly. 
Needless to say, this often is a self-defeating prac- 
tice if the FTU contains faults. Unless some type of 
hardware tester is available to check the control se- 
quences produced by the hardwired MSG, the diagnosis 
can be a tedious and difficult job. The inherent 
difficulty of diagnosing the hardwired MSG is one of 
several serious drawbacks to using the hardwired 
MSG in a diagnosable system. 


Diagnosis of a microprogrammed MSG can be made a 
somewhat more feasible job than diagnosis of a similar 
hardwired MSG. This is due to the manner in which the 
microprogrammed MSG produces control sequences. The 
major component to be diagnosed is the control memory 
that stores the control sequences. Once the logic 
circuitry used to sequence the fetching and executing 
of control sequences from the control memory has been 
externally diagnosed, a program of diagnosing the 
control memory can be initiated. The type of diagnosis 
used for the control memory depends, of course, on the 
nature of the memory. If the memory is a read only 
memory such as transformer or capacitor read only 
storage, then it is likely that a section of control 
memory containing diagnostic test patterns will be 
inserted into the control memory unit. Once this is 
done, a diagnostic sequence can be run using the logic 
in the MSG to read the contents of the memory to check 


for proper memory operation. Needless to say, this 
entails additional logic circuitry to provide for ex- 
ecuting such diagnostic sequences and a method of 
verifying the diagnostic control memory. If the con- 
trol memory is both a read and write memory, then a 
diagnostic sequence can be constructed to both write 
and read data to the control memory. Here again, 
additional logic circuitry may be required to facili- 
tate implementation of the diagnostic sequences. 
Almost any advantage in diagnosis that the micropro- 
grammed MSG has over the hardwired MSG is due to the 
more regular structure of the microprogrammed MSG. 
Even though the structure of the microprogrammable MSG 
is regular, it is very flexible in that it is pro- 
grammable and as such retains the inherent power of a 
stored program computer. 


The next step is to develop four relations which 
illustrate the differences between a hardwired and a 
microprogrammed MSG. The goal of these relations is 
to lend substance to the claim that a microprogrammed 
MSG is needed in diagnosable computer system environ- 
ments. 


Relation 1 
Given: (1) an MSG with a fixed repertoire of 
CR sequences; (2) a specified virtual machine instruc- 
tion set. 
Then, the effectiveness of a diagnosis of the 
FTU by a virtual machine diagnosis program is sensitive 
only to the design of the FTU. 


The basis of Relation 1 is that the given condi- 
tions constrain the design. Since the virtual machine 
instruction set is specified and can command only a 
fixed set of CR sequences, the only way to make diagno- 
Sis more or less effective is in the design and organ- 
ization of the FTU. 


The major observation that should be made at this 
point is that if it is desired to have effective FTU 
diagnosis, then the initial system design is the cri- 
tical element in the process of system production. 
Once a hardwired system is physically realized, the 
only way to improve diagnosis is to write better vir- 
tual machine diagnosis programs. It is often the case 
that due to system design little improvement can be 
achieved on the virtual machine level. In fact, due to 
the complexity of most systems, many problems are un- 
known during the design phase and are only discovered 
upon fabrication of the system. This can lead to ex- 
tremely serious diagnosis problems commonly reflected 
in the statement that the "diagnostic programs are 
programs that execute correctly when no other programs 
can execute due to hardware faults", 


Relation 2 
Given: 
of CR sequences; 
struction set, 
Then, the diagnosis of the FTU is sensitive 
to the design of the FTU and to the CR Sequences 
available in the MSG repertoire. 


(1) an MSG with a variable repertoire 
(2) a specified virtual machine in- 


The basis of Relation 2 is that even though the 
virtual machine instruction set is specified, addition- 
al CR sequences can be added to aid in the process of 
FTU diagnosis. In this way, the design of the FTU is 
not the sole determining element as it is in the case 
of Relation l, 


The key point to observe here is that by having 
a variable repertoire of CR sequences in an MSG it may 
be possible to enhance diagnosis after the system is 
fabricated. This is not to say that the design of the 
FTU is not as critical as it is in the case of the 
hardwired MSG. The design of the FTU is still a major 
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factor in the effectiveness of its own diagnosis. 


Relation 3 

Given: (1) MSG/FTU system A having an MSG 
with a variable repertoire of CR sequences; (2) MSG/ 
FTU system B having an MSG with a fixed repertoire of 
CR sequences; (3) virtual machine instruction set C3 
(4) both system A and system B can execute only in- 
struction set C3; (5) both system A and system B po- 
ssess identical FTU units. 

Then, (1) the ability of FTU diagnosis in 
system A to detect FTU faults during the diagnosis 
procedure is greater than or equal to the ability of 
system B to perform the identical operation; (2) the 
ability of FTU diagnosis in system A to isolate FTU 
faults is greater than or equal to the ability of 
system B to perform the identical operation. 


Both parts of Relation 3 are based upon the same 
concept. The sets of CR sequences that can be executed 
in system B is fixed by the virtual machine instruction 
set and the design of the FTU. Hence, system B can 
only perform fixed sets of CR sequences, In the course 
of performing the diagnosis procedure on the system B 
FTU it can possibly be that either fault detection or 
fault isolation or both can be improved by the addi- 
tion of a CR sequence not available in system B. If 
no CR sequences can be added to improve system B 
diagnosis, then system A will have a level of diagnos- 
ability equal to system B. 


Relation 3 shows that enhanced FTU diagnosis may 
possibly be achieved simply by use of an MSG unit 
having a variable CR sequence repertoire instead of an 
MSG unit having a fixed CR sequence repertoire. 


Relation 4 
Given the conditions in Relation 
Then, the number of CR sequences 
form diagnosis in system B is greater than 
the number of CR sequences used to perform 
sis in system A, 


3. 

used to per- 
or equal to 
FTU diagno- 


The proof of Relation 4 is based on the fact that 
Since the CR sequences are fixed in system B, it is 
possible that unneeded CR sequences will be executed 
in the course of FTU diagnosis since every virtual 
machine instruction causes execution of a predetermined 
set of CR sequences. If there are unneeded CR se- 
quences, these CR sequences can be omitted in the 
diagnosis of system A since the MSG repertoire is 
variable. Hence, the length of the diagnostic se- 
quence in system B is greater than that of system A, 
If no CR sequences can be omitted, then the lengths 
of the diagnostic sequences in both systems are equal. 


IV. DESIGN GUIDELINES 

To achieve the design goal of diagnosability it is 
necessary that the system designer always keep con- 
current fault detection and fault isolation central in 
the system design. The next section deals with guide- 
lines pertaining to the MSG. Throughout this section 
it should be remembered that the guidelines are for 

the purpose of achieving the design goal of diagnos- 
ability. 


There are a number of important items to be con- 
Sidered in the designing of an MSG. The design guide- 
lines are as follows: 

1. The MSG should have a variable repertoire of 
CR sequences, i.e. it should be microprogrammed. 
Relations 1,2,3,and 4 illustrate the advantages that 
are available to the microprogrammed MSG in diagnosis 
of the FTU. These advantages cannot be ignored since 
diagnosis of the FTU is a key part of designing a 
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diagnosable computer system. This is not to imply that 
a hardwired MSG system is completely undiagnosable, 

The difficulty encountered with the hardwired MSG/ 
virtual machine instruction diagnosis is that it im- 
plies attempting to diagnose the real machine by means 
of the virtual machine which itself is being simulated 
by the real machine. The process is somewhat self- 
defeating. 

2. The MSG has to satisfy the diagnosable system 
design goals. That is, ideally all faults must be 
detected, and ideally all faults must be capable of 
being isolated. This goal is more easily approachable 
in a microprogrammed MSG due to its structure, which 
is more regular and often less complex than a compar- 
able hardwired MSG. External fault diagnosis and iso- 
lation procedures for a microprogrammed MSG are more 
straightforward to develop than for a comparable 
hardwired MSG. 


Many techniques exist for concurrent fault 
detection in a microprogrammed MSG. For example, the 
control memory could utilize fault detection techniques 
as proposed by Szygenda in his fault tolerant memory 
design(7). Coding techniques can be utilized to pro- 
vide fault detection on gating function lines. 

Another important area of fault detection is in the 
area of timing sources used to sequence the operation 
of the MSG and FTU operations. A number of techniques 
for detecting faults in timing sources are available(3), 


It is necessary to next consider the problem of 
FTU access for diagnosis. This is needed since the 
MSG must access the logic subsystems of the FTU in 
order to facilitate diagnostic testing of the FTU. 
Since, in any nontrivial system, diagnostic testing 
will be conducted on a non-monolithic or modular 
basis. It is important to consider the diagnosis of 
partitions and combinations of partitions. 


Figure 3 illustrates a system composed of two 
partitions - SS} and SS.. If both SS, and SS. are 
strictly combinational Tose, it is quite possible that 
access to only the input of partition SS, and the out- 
put of partition SS. will be sufficient fos diagnostic 
testing and fault iSolation. To determine if this is 
true, diagnostic test generation can be employed to 
discover how faults can be detected and how well they 
can be isolated. If either SS, and SS. or both are 
sequential logic systems, the picture changes and be- 
comes less predictable. Problems arise in attempting 
to diagnose the composite system by means of using only 
the system input and the system output. 


FIGURE 3 
A Two Partition System 


These problems arise from the nature of sequential 
logic systems. Sequential logic systems can be diffi- 
cult to diagnose for three basic reasons. First, the 
sequential logic must be initialized to some known 
state for diagnosis. Whether or not this is a diffi- 
cult task depends upon the design of the logic in 
question. If no means exists to preset the logic to a 
known state by using an additional input or preset line, 
then it is necessary to use a homing sequence to drive 


the sequential logic to a known state before diagnosis 
can be started. Second, if the sequential logic is 
incompletely specified, there will be input sequences 
which can cause unpredictable outputs, Third, se- 
quential logic can possess a large number of states 
which can make diagnostic testing a very extended task, 
For example, a 24 binary digit counter has over 16 
million states. If the sequential logic is complex 
with a large number of possible states, diagnosis can 
be a difficult effort to accomplish. Combining 
sequential logic subsystems such as in Figure 3 can 
often result in a system that is more difficult to 
diagnose than its constituent stibsystems. In most 
computer systems, usually more than just two subsys 
tems are combined to obtain the system. Combining 
sequential logic subsystems results in a larger 
sequential logic system which presents correspondingly 
more difficult problems in diagnosis. Providing 
accessibility to the logic subsystems for the purpose 
of diagnostic testing is important. On computer 
systems not designed with sufficient access for 
diagnosis to logic subsystems, maintenance engineers 
often externally access logic subsystems with os- 
cilloscopes and other instrumentation to facilitate 
diagnostic testing. Access to logic subsystems for 
diagnosis must be included in the initial system 
design, 


Ve CONCLUSIONS 


The relations that have been presented in this 
paper illustrate that a microprogrammed control unit 
can have several important advantages in a diagnosable 
computer system. Namely, a microprogrammed control 
unit may enjoy enhanced fault isolation and a shorter 
diagnostic test sequence over a comparabie hardwired 
control unit. Integrally enmeshed in the design of a 
diagnosable computer system is the need for access to 
the logic subsystems of the FTU for the purposes of 
diagnostic testing by the MSG. Design of a diagnosable 
computer system is a process of balancing access to 
logic subsystems with the corresponding ability of the 
MSG to properly use available FTU access points for 
effective diagnostic testing. To achieve a diagnosable 
system, the designer must keep this balancing process 
in mind in all of his design activities. 
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ABSTRACT 


Recent advances in computer technology have made 
the design of large and very flexible associative proc- 
essors possible. Such systems are extremely complex 
and must be adequately protected against failures if 
they are to be used in critical application areas such 
as air traffic control or for performing control func- 
tions in fault-tolerant computers. This paper summa- 
rizes the results of a study which has indicated the 
techniques that are applicable in the design of fault- 
tolerant associative processors. Associative process- 
ors are divided into four classes of fully parallel, 
bit-serial, word-serial, and block-oriented systems. A 
technique for modularizing the design of an associative 
processor is given. The detection of errors within 
modules is discussed for the four classes mentioned 
above. Several schemes for reconfiguration are dis- 
cussed which allow us to establish an appropriate inter- 
communication pattern after replacing the faulty module 
by a spare. The design of a fault-tolerant associative 
processor, which uses some of the techniques discussed 
previously, is presented. 


BACKGROUND 


Associative processors are of interest since they 
enable us to solve many data processing problems for 
which digital computers with conventional architectures 
are either unsuitable or highly inefficient. Based on 
the applications that have been proposed for associa- 
tive processors, there are at least two reasons for 
studying the fault tolerance problems of such devices: 
(1) In some proposed application areas, such as air 
traffic control [1], the effect of an undetected fault- 
induced error may be catastrophic. (2) To be able to 
perform control functions [2] in a fault-tolerant com- 
puter, an associative device must itself be fault- 
tolerant, since, otherwise, it will become part of the 
system's hard core and will contribute heavily to its 
unreliability. In addition, the extreme complexity of 
large, general-purpose associative processors necessi- 
tates the incorporation of fault tolerance features 
into their design. 

It is remarkable, therefore, that the problem of 
fault-tolerance of associative devices has remained 
virtually untouched. Ewing and Davies [3] give tech- 
niques fos coping with some hardware malfunctions in a 
plated-wire implementation of a particular associative 
processor. Proudman [4] suggests that a single error 
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correcting code can be used in conjunction with mis- 
match detectors with a threshold of 2 to detect storage 
errors. This paper summarizes the results of a study 
on fault tolerance techniques for associative process- 
ors [5]. We will concern ourselves with hardware 
faults and will assume the programs to be correct repre- 
sentations of intended algorithms for the specified do- 
main of operation. We may note, however, that the 
simplified software of associative processors (e.g. 
fewer loops) with respect to conventional systems, re- 
sults in a proportional simplification in the problem 
of software fault tolerance. 

In the remainder of this paper, we will refer to 
fully parallel, bit-serial, word-serial, and block- 
oriented architectures for associative processors. 

This classification, which is based on the degree of 
parallelism in operations or, alternatively, the amount 
of storage associated with each unit of processing 
logic, is described briefly as follows. A more detailed 
discussion of these concepts and a comprehensive set of 
references can be found in [6]. 

(1) In fully parallel associative processors, proc- 
essing logic is associated with each bit of 
stored data. Most fully parallel systems im- 
plement only the exact-match search operation 
in hardware and use software techniques for 
arithmetic, logic, and more complex searches. 
In bit-serial associative processors, process- 
ing logic is associated with each word of 
stored data. All the words can be processed in 
parallel, each in a bit-serial manner. 

In word-serial associative processors, a single 
processing unit operates serially on all the 
words. This approach essentially represents 
hardware implementation of a simple program 
loop which is used for linear search. 

In block-oriented associative processors, one 
block of information is associated with a unit 
of processing logic. A low-cost implementa- 
tion of such a system may use a head-per-track 
magnetic recording memory in which each block 
is stored on one or more. tracks. 


(2) 


(3) 


(4) 


FAULT TOLERANCE APPROACH 


Figure 1 shows a model for an associative processor 
which applies to the three classes of fully parallel, 
bit-serial, and block-oriented systems. Since word- 
serial associative processors closely resemble conven- 
tional systems, their fault tolerance problems can be 
studied separately. Each processing element (PE) in 
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Figure 1 consists of one unit of processing logic and 
its associated storage elements. In general, the proc- 
essing elements in the PE array communicate with each 
other and the exact pattern of intercommunication is 
application-dependent. 

A study of fault-induced errors in an associative 
processor shows that they are not easily detectable 
since a single fault may cause an arbitrary number of 
errors. This is evident for faults in global subsys- 
tems of Figure 1, such as the input and mask registers. 
A single fault in one processing element may cause 
errors in others because of PE intercommunication. The 
problem is further compounded by the fact that each PE 
performs logic and selective write operations on indi- 
vidual data bits which as we know are not easily check- 
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Figure 1. Associative Processor Model 

The associative processor of Figure 1 can be made 
fault tolerant by dividing the PE array into identical 
modules which share spares. Let us assume that we have 
M modules, each consisting of P processing elements. 
It is possible to distribute the decoding and response 
resolution functions among the modules in order to re- 
duce the complexity of the non-array portion to a mini- 
mum. Figure 2 shows the modules and their interconec- 
tions. One-dimensional intercommunication between mo- 
dules has been assumed for simplicity. 
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Modularized Associative Processor 


Figure 2. 


Given a modular associative device as shown in 
Figure 2, it can be made fault tolerant by the follow- 
ing steps: (1) Incorporating internal failure detection 
ability within each module; (2) Adding S spare modules; 
and (3) Designing switching mechanisms and correspond- 
ing algorithms for reconfiguration. We will assume 
that the M + S operating and spare modules are perma- 
nently connected to the main data buses and that spe- 
cial isolating circuits exist between each module and 


the data buses. Therefore, reconfiguration takes place 
by "power switching" and by providing alternate inter- 
communication paths between modules. 


DETECTION OF MODULE FAILURES 


We first discuss the problem of error detection in 
associative processors with respect to the four classes 
mentioned previously. Then we will consider a technique 
which is applicable in all cases. 

A fully parallel associative memory with only 
exact-match search operation and without masking capa- 
bility can be protected against storage errors by using 
a code with a minimum distance of k in conjunction with 
mismatch detectors with a threshold of k. With this 
scheme, stored words containing k-1 or fewer errors will 
never respond to a search operation and are effectively 
isolated from the rest of the system until periodic di- 
agnosis routines detect their failure. The difficulty 
is that such an associative device will have no appli- 
cation besides simple table look-up. For most other 
applications, masking capability, more complex search 
types, and arithmetic operations are essential. 

Considerations for bit-serial systems are similar 
to those for fully parallel systems. One advantage 
which exists here is the serial processing of bits in 
each word. This allows us to artificially extend each 
operation to the entire word by performing "null" opera- 
tion on bit positions not originally specified. *Now, 
since all the bits of each word are processed serially, 
codes with low-cost serial encoding and decoding can be 
used to protect against storage errors. It should be 
noted, however, that if operations on small fields with- 
in the words are to be performed frequently, the above 
scheme may result in a significant reduction of speed. 

As noted earlier, because processing is performed 
serially in a word-serial system, protection against 
failures becomes relatively simple. Low-redundancy 
coding can be used to protect against storage errors. 
Failures in the processing logic may be detected through 
self-checking [7] design. Self-checking translators may 
be needed to convert the storage encoding (S-encoding) 
to an encoding suitable for processing (P-encoding). 

The main requirement on the P and S encodings is that 
fast (parallel) translation between the two must be 
possible. 

One favorable property of block-oriented systems 
with respect to fault tolerance is that during each 
operation cycle, a processing element operates on the 
entire block of information assigned to it. This en- 
ables the use of block codes which result in relatively 
low redundancy and have simple serial checking algo- 
rithms. If mechanical storage devices are used to im- 
plement such devices, error bursts become very probable 
due to dust particles, minute scratches, or defects in 
the oxide coating. It has been noted that low-cost 
arithmetic error codes are very effective for coping 
with such burst errors [8]. 

As can be seen from the previous discussion, low- 
redundancy coding techniques are applicable only in spe- 
cial cases. Design of logic circuits in self-checking 
form [7] (4.e., in a way that internal circuit failures 
manifest themselves on the circuit's output) particularly 
if l-out-of-2 encoding is used, appears to be promising. 
However, because of the relatively higher complexity of 
the self-checking design approach as compared to low- 
redundancy coding techniques, this approach should be 
used when others fail or for protecting the system's 
hard core. A detailed discussion of self-checking de- 
sign concepts is beyond the scope of this paper [5]. 


RECONFIGURATION THROUGH SWITCHING 


For a modular associative device to tolerate module 
failures, the module interconnections should not be ri- 
gid as shown in Figure 2. Rather, the modules should be 
interconnected through specially designed switching cir- 
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cuits which prevent a system failure as a result of the 
failure of a module. The setting of these switching 
mechanisms determines the system configuration and can 
be changed by a central monitor if required. If a 
module error is indicated and the existence of a perma- 
nent failure is determined, reconfiguration procedures 
must be initiated to establish a new working configur- 
ation. In general, data transfers between modules and 
correction of fault-induced errors are needed as part 
of the reconfiguration process 

We will assume only unidirectional (left to right) 
data flow between the modules in Figure 2. The gener- 
alization of the results to bidirectional data exchange 
is straightforward. After detecting the existence of a 
faulty module, the following steps must be taken before 
normal operation can resume: (1) Locating the faulty 
module; (2) Determining a new working configuration; 
(3) Initiating appropriate data transfers; and (4) 
Effecting reconfiguration through switching. The cri- 
teria that should be used in evaluating each reconfi- 
guration scheme include: (1) The amount of data trans- 
fers needed; (2) The complexity of the reconfiguration 
algorithms; (3) The number of spares S needed for toler- 
ating f module failures; and (4) The complexity of 
additional switching circuitry. 

We first discuss centralized reconfiguration 
schemes in which the switching hardware is external to 
the modules. A straightforward solution is the use of 
a "permutation network" [9] which can interconnect the 
modules in any order. Such a permutation network can 
be implemented as a cellular array [10] of two-state 
basic modules. Since the complexity of such a cellular 
permutation network is roughly proportional to the 
square of the number of modules, its use can be justi- 
fied only if a relatively small number of modules are 
involved. The two-state basic modules can be used in 
a different way to form a "shorting network" [9]. As 
shown in Figure 3, such a shorting network can be used 
to route data around the faulty and spare modules. One 
disadvantage of this scheme, particularly as shown in 
Figure 3, is the excessive amount of data transfers 
needed in the case of a failure. The number of trans- 
fers needed can be reduced by optimal placement of the 
spare modules [5}j. 
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Figure 3. Reconfiguration with a Shorting Network 


Another approach to the reconfiguration problem is 
the use of a distributed switching mechanism; i.e., dis- 
tributing the switching hardware among the modules. This 
can be done by providing each module with a set of input 
and output lines instead of one as shown in Figure 2. 
Then if a successor module connected to one module out- 
put fails, a module connected to another output can act 


as its successor. The simplest case, which will be dis- 
cussed here, is when each module has two sets of inputs 
and two sets of outputs. The two inputs and two outputs 
are distinguished by the letters H and V (horizontal and 
vertical). The module has four states denoted by HH, 
HV, VH, and VV, depending on whether the H or V input is 
used and whether the output is generated on the H or V 
output. 

Figure 4 shows a two-dimensional arrangement of the 
basic modules. It can be seen in Figure 4 that the 9 
modules can be connected into a string. If any single 
module fails, the remaining 8 can continue their opera- 
tion. Double module failures will leave at least 6 
usable modules. Hence, with M=8 and S=l1, this scheme 
can tolerate all single module failures. With M=6 and 
S=3, all double failures can also he tolerated as well 
as some triple failures. The problem of optimal inter- 
connection patterns for the tolerance of a maximum num- 
ber of module failures has not been solved. The basic 
advantage of this scheme is that the switching mechanism 
is not part of the system's hard core since a failure in 
the switching circuits is equivalent to a module failure. 
The main disadvantages of this scheme are the complexity 
of the reconfiguration algorithm, excessive data trans- 
fers, and tolerance of fewer than S failures. 


(c) OPERATION AFTER THE FAILURE OF MODULE NUMBER 2 


Figure 4. An Example of 
Distributed Reconfiguration 


A CASE STUDY 


In this section, we illustrate the applicability of 
some of the techniques discussed previously by present- 
ing the design and evaluation of a fault-tolerant asso- 
ciative processor called SPARE (inverse acronym for 
Error-tolerant and Reconfigurable Associative Processor 
with Self-repair). SPARE is essentially a fault-toler- 
ant version of an associative processor which has been 
described previously [3]. Figure 5 shows a block dia- 
gram of the non-redundant system. The random-access 
memory is used for storing instructions and constants 
and consists of 4096 24-bit words. The associative 
memory contains 512 96-bit words. 

The non-redundant associative processor of Figure 
5 can be divided into two parts: (1) The associative 
(parallel) section, which consists of the associative 
memory array, bit column selection logic, and word lo- 
gic; (2) The control and sequencing (sequentia} section, 
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which contains all other subsystems of Figure 5. The 
sequential section uses status signals and test inputs 
for monitoring the operation of the parallel section. 
We now briefly discuss the three main features of 
SPARE; i.e., error tolerance, reconfigurability, and 


self-repair. 
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To achieve error tolerance, the parallel section 
of SPARE is divided into M identical modules. S spare 
modules are shared by the operating modules. Each mo- 
dule has internal failure detection capability which is 
provided by self-checking design of its circuitry using 
two-rail encoding of logic variables. When a module 
error is indicated to the sequential section, the re- 
covery mode is entered and the final result may be the 
replacement of the faulty module by a spare module. 

The sequential section of SPARE resembles a small 
general-purpose computer and can, therefore, be made 
fault tolerant by conventional techniques. 

One of the very important properties of associa- 
tive processors is simple modular growth. The size of 
an associative processor can grow without a need to al- 
ter its algorithms. This suggests that if additional 
processing capability is required, the redundant proc- 
essing logic in SPARE can be utilized. Even the two 
channels of the two-rail circuits can be used indepen- 
dently to double the processing capability if certain 
design criteria are met [5]. Specifically, we postu- 
late the following operation strategy for SPARE: (1) 
During normal operation the system works in redundant 
mode with a number of spare modules; (2) If a module 
failure occurs or additional processing capability is 
needed and if a sufficient number of spares are avail- 
able, they are switched in; (3) If a module failure 
occurs or additional processing capability is needed 
and spare modules are not available, the system recon- 
figures into simplex mode by utilizing the two channel 
of the two-rail circuits independently. | 

Of the reconfiguration techniques discussed in the 
previous section, the one using a permutation network 
seems to be suitable for SPARE since only one inter- 
communication line (two in self-checking design) exists 
between modules and the number of modules is expected 
to be small (M=4 or 8, for example). The self-repair 
process will then essentially consist of computing and 
setting of a new state for the permutation network. 
This process must be followed by a recovery procedure 
to transfer the data stored in the failed module to 
the one which replaces it. The permutation network has 
a two-rail self-checking design but no spare is pro- 
vided for it. 


In computing the reliability of SPARE, we will 
assume that the coverage factor C includes the reliabil- 


ity of the permutation network. 


Using the reliability 


modeling technique of Bouricius et at [11], we find the 
_ reliability improvement factor defined as (1-R,,(T)] + 
(1-R,(T)] as a function of mission time T for several 
configurations of SPARE (R,; and R,y denote the non- 
redundant and redundant reliabilities, respectively). 
Figure 6, which depicts the resulting curves, shows 
that for mission times which are short compared to the 
MTBF for the non-redundant system, a significant in- 
crease in reliability is possible with a low levle of 
modularization and a relatively small number of spare 


modules. 
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Figure 6. Reliability Improvement 
Factor for SPARE (C=0.99). 


CONCLUSION 


In this paper, we have presented some results of a 


study on 
Our main 


(1) 


(3) 


(4) 


the fault tolerance of associative processors. 
conclusions are as follows: 

Dynamic redundancy is to be preferred over 
static approach because associative processors 
lend themselves naturally to modularization 

and since spares can be shared by a number of 
identical modules. 

Low-redundancy coding techniques are applicable 
for error detection in associative processors 
but only in special cases. In particular, the 
use of arithmetic error codes for block- 
oriented systems appears to be promising. 
Application of self-checking circuit design 
techniques seems to be an attractive alterna- 
tive for error detection in associative devices. 
Complex switching mechanisms and algorithms need 
to be devised to enable the sharing of spares 
by a collection of identical modules which 
communicate with each other. 


Further research is needed in two equally important 


areas. 


The first area is the design of completely 
checked digital circuits. 


Systematic techniques need 


to be developed to aid the designers in choosing suit- 
able input and output encodings and producing a self- 
checking design when presented with a non-redundant cir- 


cuit or its functional behavior. 


The second area deals 


with general techniques for reconfiguration in array 


processors. 
presented here are possible in two directions. 


Extension and generalization of the results 
First, 


one can conceive of other interconnection schemes for 
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the case where one-dimensional intercommunication ex- 
ists between modules. For example, we may consider a 
three-dimensional interconnection pattern in which 
there are three choices for each of the left and right 
neighbors for a module. Second, one may seek general- 
izations to the cases where multi-dimensional module 
intercommunication is used. This is a considerably 
more complex problem. 
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ABSTRACT 


a) Use of multiple and distinct software modules to 


detect faults (including design and translation faults). 


This paper presents a fault tolerant multiprocessor 
architecture suitable for real time control applications 
requiring an extremely high degree of reliability. 

The architecture satisfies the following requirements: 


1) Ability to deal with software as well as hardware 
faults: The proposed architecture is based on the 
assignment of distinct but redundant software modules 
to each task. 


2) Efficient use of resources: The proposed archi- 
tecture is a multiprocessor using time redundancy for 
fault correction. Thus, redundancy (beyond that 
needed for fault detection) is invoked only when a fault 
is detected. In normal operation, this extra capacity 
is available as an additional computing resource. 


3) No hard core: In addition to the usual replication 
of system components, a partitioned system executive 
and a unique communication facility is defined which 
insures that the available redundancy will not be lost 
through a ''domino" effect. 


4) Interaction of computing units with sensors and 
effectors: The manner in which system architecture 
must be responsive to the amount and type of redun- 
dancy provided by the sensors and effectors is shown. 


5) Use of current technology: The proposed archi- 
tecture is based on the use of currently available 
hardware for the major system components. 


After a detailed description of the architecture and 
the method of system operation, the system is related 
to existing fault tolerant systems, and unique charac- 
teristics of the present design are indicated. 


I. INTRODUCTION 


There are many commercial and military control 
applications for which the computer technology is 
currently available, but due to the dire consequences 
of failure, computer systems cannot directly be used. 
These applications usually involve control of systems 
in which human life may be at stake, such as fly-by- 
wire aircraft control, or automatic braking of a train 
or an automobile. The present paper is concerned 
with techniques for achieving a sufficient degree of 
reliability to make presently available computers 
applicable ta such systems. 


PROBLEM DEFINITION 

In this paper we are concerned with problems of com- 
puter control of real time processes under the follow- 
ing conditions: 1) Ultra-reliable system operation 
(probability of failure approaching zero), 2) Use of 
current technology and mainly off-the-shelf components 
and subsystems, 3) Realistic cost constraints (i.e. , 
limited use of hardware redundancy), 4) Completely 
specified task environment: all operations and actions 
required by the system are known and can be factored 
into the design, 5) Sensor and effector redundancy is 
sufficient not to be a limiting factor. 


ARCHITECTURE CONCEPTS 

The following architectural concepts, used in-the 
design, are discussed in some detail in the balance 
of the paper: 
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b) Use of time redundancy to correct detected errors 
(time redundancy is efficient in that the redundant com- 
putation, beyond that needed for fault detection, need 
be invoked only after an error is detected; resources 
can be-used productively under normal operation). 


c) Integrated consideration of sensors, computer and 
effectors. 


d) No hard core items: distributed and partitioned 
executive control; redundant hardware, software, in- 
formation storage, sensors, effectors, communications, 
power, etc. 


e) No 'domino! effect: isolation via hardware restric- 
tions on communication and control. 


UNIQUE ASPECTS OF THE PROPOSED ARCHITECTURE 


We believe that the architecture presented in this 
paper is unique with respect to the following items: 


a) A non-interfering type of broadcast communication 
system. (A discussion of alternative types of inter- 
module communication systems is presented in 
Appendix A. ) 


b) A system executive which is partitioned into iden- 
tical, independent, autonomous units. (A discussion 
of alternative types of fault tolerant executives is pre- 
sented in Appendix B. ) 


c) The consideration of design and translation errors 
(as well as damage faults) in both hardware and software. 


Fault tolerance is achieved by the use of redundancy; 
however, unless a suitable degree of isolation exists 
between the major system components, it is possible 
that a single failure can destroy a redundant system. 
Items (a) and (b) appear to offer an extremely high 
level of isolation at a low cost in both system and hard- 
ware complexity. The increasing use of LSI circuits 
in the construction of computer hardware, with the 
associated infeasibility of exhaustive testing of such 
complex units, increases the probability that a com- 
puting unit will have undetected design errors. In the 
case of software, it is common knowledge that the 
complexity of such programs also make their exhaustive 
testing impractical. Therefore, it is necessary to 
design fault-tolerant computers not only from the point 
of view of protection against future device damage 
failure; rather, it may have to be assumed that both 
hardware and software design and translation errors 
are present initially. Consideration of this class of 
errors appears to be lacking in previously published 
works dealing with fault tolerant architectures. 


RELATION TO OTHER FAULT TOLERANT SYSTEMS 


Even though organized in a unique way, many of the 
goals, constraints, and concepts we invoke are com- 
mon to other systems. These include (references 
are examples, this is not meant to be an exhaustive 
listing): a) Off the shelf major subsystems, Ref. 9, 2; 
b) Lack of hard core, Ref. 9,7; c) Multiprocessor 
organization, Ref. 9,7; d) Time redundancy, roll- 
back, Ref. 1,7; e) Minimal special hardware for. 
fault detection or correction, Ref. 9; f) Loose syn- 
chronization, no lock-step, Ref. 9,1; g) Adjustable 
degree of fault tolerance. Comment:- This item 


appears to be common to almost all of the FT archi- 
tectures. 


A comparison of the present system with,three other 
fault tolerant systems is given in Ref. 6 


II. ARCHITECTURAL CONCEPT 


In the following discussion, we will present an arch- 
itectural concept rather than a complete system de- 
sign; therefore, decisions concerning the ''best" . 
alternative for some of the more detailed structure 
will not be made here, but must be determined by 
the specific environment to which the system will be 
applied. The level of detail of the discussion will be 
limited to the following major system components: 


® Processing Units (PU): (mini) computers contain- 
ing computational capability, registers, and 
scratch-pad memory. 


® Memory Units (MU): random access memory 
augmented by some minimal logic capability (to 
be described later). Each PU has a MU assigned 
to it. 


@ InterModule Communication System (IMCS): the 
bussing and special registers used for internal 
communication. 


® Timing and Synchronization System (TSS): the 
conventions by which the system components co- 
ordinate their activities. 


@ Input/Output Processors (IOP): elementary pro- 
cessors which interface between the computing 
system and the sensors and effectors. Input func- 
tions include A/D conversion, multiplexing, and 
buffering. Output functions include voting, buffer- 
ing, and D/A conversion. 


® Work Schedule and Contingency Plan (WSCP): a 
complete schedule for performing the required 
tasks, including allocation of resources to the tasks, 
and a contingency plan for task performance under 
various fault conditions (i.e., loss of resources). 


® System Executive Software (EXEC): primarily the 
functions of fault detection and system status eval- 
uation, as well as implementation of the WSCP. 


@® Sensors (S) 

® Effectors (E) 

® Application Software (AS) 
GENERAL SYSTEM DESCRIPTION 


The system consists of two or more (preferably) 
identical PU, each with its own (functionally) iden- 
tical resident EXEC and copy of the WSCP. Each 
PU, MU, and IOP has an output bus (which only it 
can write on) which goes to a dedicated read-only 
register in every other PU, MU, andIOP. The set 
of all such dedicated read-only registers in a system 
component will be called its Communication Memory 
(CM). 


In the basic configuration, the computing system 
operates in a 4-phase cycle. During Phase I (Pl), 
each PU works on one or more of the tasks assigned 
to it by the WSCP. Each task which forms a subset 


* Raci ; , 
Some copies of this report are available from the 
authors. 
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of the application software is produced in three or 
more distinct versions and each version is partitioned 
into segments which can be run in the Pl time interval. 
(Typically, the execution of such a segment will result 
in a control signal to be sent to one of the effectors 

in P4 of the cycle.) During Pl, exactly two versions 
of each task are executed on separate PU, and for 

this reason, the simplest arrangement would be an 
even number of PU grouped into pairs, with each 

pair of PU performing the same set of tasks. 


During P2, paired PU compare results (partial or 
complete) via the IMCS. If an error is detected (dif- 
ference in results), a HELP request is transmitted 
to the other PU of the system; otherwise, the valid 
data can either be saved for further processing, 
stored in the MU's assigned to the PU's, or stored 
in an IOP for later output to an effector. 


During P3, a PU designated by the WSCP will respond 
to a HELP request by executing additional versions of 
the questionable computation. If no HELP requests 
are outstanding, low output rate, background, high- 
output rate (but not fully protected), self-check, 
housekeeping, etc., jobs are performed as specified 
by the WSCP. 


During P4, HELP request situations are resolved by 
majority (or plurality) vote. This decision process 

is carried out in every PU of the system by comparing 
the data available inits CM. Defective system com- 
ponents are identified by their lack of agreement with 
majority, and their status is recorded in every PU. 
This up-dated status information, in conjunction with 
the WSCP, determines new job assignments and re- 
source allocations within the system. Other activities 
carried out in P4 include acceptance, verification, 
and storage of sensor data and transmission of control 
Signals to the effectors. 


In the following subsections, we will expand and 
clarify many of the concepts presented in the above 
general system description. 


THE INTERMODULE COMMUNICATION SYSTEM (IMCS) 


The IMCS aids in the performance of four primary 
system functions; these are: (a) Fault detection. It 
provides the means by which the PU can compare 
results. (b) Input/Output Communication. (c) Recon- 
figuration. In the event of failure of one of the phy- 
sical units of the system, the communication patterns 
between the remaining units will typically change; the 
IMCS provides a simple means for accomplishing this 
function. (d) Isolation. The IMCS provides commun- 
ication without allowing a defective unit to damage 
either other functioning units, or the IMCS itself. 


As described earlier, the IMCS is implemented by 
having a (possibly redundant) bus assigned to each 
physical system module which only the assigned 
module can write on. The bus, simultaneously, 
drives a dedicated read-only register in every phy- 
sical system module. The receiving module, based 
on its assigned duties as specified by the WSCP, will 
typically be attentive to only one of the registers in 
its CM at any given time, and ignore the remaining 
registers. Relevant multi-word messages are saved 
by the receiving unit reading and internally storing 
the incoming information; ignored data is overwritten 
and lost. 


In a complete system design, a number of decisions 
concerning the IMCS and involving cost/performance 
trade-offs must be made. These include the width of 
the busses (i.e., serial or parallel data transmission), 


the possible use of error detecting or correcting 
codes to protect the transmitted information, re- 
dundant registers in a CM (or even redundant CM's 
in each unit), etc. While these decisions have im- 
portant practical significance, they do not signifi- 
cantly alter the architectural design in a logical 
sense. 


THE SYSTEM EXECUTIVE SOFTWARE (EXEC) 


The EXEC, a duplicate copy of which is resident in 
each PU, performs three primary system functions; 
these are: 


(a) Fault detection. This is accomplished in P2 of 
the basic system timing cycle by comparing the out- 
puts of the Pl application computations produced in 
the resident PU with those produced in the "paired" 
PU. This comparison is enabled by having each PU 
broadcast its Pl outputs over the IMCS during P2. 
When the comparison yields out-of-tolerance results 
(tolerance limits are supplied by the WSCP for each 
application module as part of the descriptive informa- 
tion associated with the module), the EXEC signals 
the detection of a fault by broadcasting a HELP mes- 
sage to the system via the IMCS. 


(b) Status determination; a recording, in each PU, 
of the proper or improper functioning of all system 
units. In P3, following the issuance of a HELP 
request, the original computations are repeated 
(original PU's, and software, unless resources are 
reduced to the point that status determination is no 
longer feasible) together with the running of one or 
more additional versions of the software on addi- 
tional PU's. In P4, the fault is resolved by majority 
(or plurality) vote as all PU's issue their results on 
the IMCS. In addition, each PU can also make a 
determination as to the source of the fault (localiza- 
tion to a PU/software module combination). If, in 
the repeated computations of P3, the fault disappears, 
a transient error is recorded against both the re- 
sponsible PU and software module. If the fault is 
repeated, then at the earliest possible later time, as 
specified by the WSCP, the computation is repeated 
with the suspect software (and original data set) run 
in a PU other than the original one; a correct result 
now permits assignment of the fault to the original 
PU, an incorrect result causes assignment of the 
fault to the software module. Status determination 
for MU's and IOP's is also performed and will be 
discussed later. 


(c) Implementation of the WSCP. Each PU has its 
own copy of the WSCP, and record of the status of 
the various system units which is compiled as de- 
scribed above. The WSCP is a complete description 
of the duties to be performed by each system unit, 
keyed to the operational condition (status) of the 
system resources. Thus, without the need for any 
additional coordination between system units, each 
EXEC schedules its portion of the system workload 
on its host PU. 


The existence of a system EXEC in each PU does not 
rule out the presence of a complementary local 
executive which might be concerned with such func- 
tions as local fault isolation and recovery, interrupt 
processing, service calls, etc. 


THE WORK SCHEDULE AND CONTINGENCY PLAN 
(WSCP) 


As defined previously, the WSCP is a complete de- 
scription of how the workload is to be carried out by 
the system units. Sucha plan, of course, must be 
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custom-tailored to the specific application environ- 
ment and its requirements. However, the question 
of how to treat units which have demonstrated faulty 
behavior is of a somewhat more general concern. 
The architectural philosophy we are advancing here 
is one which anticipates and is tolerant of occasional 
errors (including those caused by faulty design) in 
both hardware and software. Thus, unless a unit 
produces such a high incidence of failures that the 
timing commitments of the system are threatened, 
no redistribution of workload is required. As a pre- 
cautionary measure, software modules showing 
faulty performance could be reloaded from permanent 
storage. Further, local diagnostic and recovery 
procedures (e.g., invoking redundant hardware 
within a PU) might improve the condition of a damaged 
hardware unit. In general, even when a unit is 
severely damaged, it should still be able to perform 
some functions correctly and thus contribute to veri- 
fication tasks. That is, if a damaged unit "A" pro- 
duces an answer to a computation which agrees with 
unit ''B'' but disagrees with unit ''C'', we would be 
inclined to accept the answer of ''B'' as correct. 

The architecture presented here is able to accept 
such marginal contributions by damaged units. 


THE TIMING AND SYNCHRONIZATION SYSTEM (TSS) 


The four phase basic system timing cycle has been 
functionally described in previous sections and we 
will restrict our discussion here to the questions of 
timing requirements and synchronization. 


In a real-time control application, we have the re- 
quirement to provide control signals to the effectors 
at (typically) regular intervals. For full fault toler- 
ant operation, the basic system timing cycle should 
have a duration less than the duration of the cycle 
corresponding to the highest rate control signal. A 
basic system cycle time of 10 ms. or larger appears 
suitable for a wide range of applications. This im- 
plies that hundreds of individual commands can be 
executed (phases 1 and 3) prior to result comparison 
(phases 2 and 4), with buffering of results prior to 
comparison employed as required. Coordination of 
activities in the various system units does not re- 
quire a lock-step type of synchronization, and thus 
we do not require a precision timing system. In 
particular, it appears advisable to adjust the task 
completion time of the tasks to be somewhat less 
than the phase interval (Pl or P4) in which the tasks 
are performed. Tasks which are not completed in 
their designated time intervals are assumed to have 
failed. 


The actual timing mechanism could either be a single 
fault-tolerant system clock (Ref. 6), or could be 
accomplished by each PU announcing over the IMCS 
the current system phase according to its own internal 
clock. A majority vote of these timing signals is used 
to determine the actual system phase time and the 
various PU's can be resynchronized accordingly. By 
allocating a small amount of tolerance (or dead) time 
to each phase, slower units should be able to com- 
plete their tasks and resynchronize without trouble. 


During startup, or resynchronization after a cata- 
strophic failure, each unit follows the procedures 
specified by the WSCP and announces its current 
perception of the system state over the IMCS, just 
as it would for any other task. 


For some applications, outputs might be required at 
a rate faster than feasible for the basic timing cycle. 
In such a case, two alternatives are possible. Three 
(rather than the normal two) versions of the task can 


be executed in parallel and conventional majority 
logic used to determine the output. Such special 
handling of a task would not interfere or conflict 
with normal system operation. The second alter- 
native is to transmit the control signals at the re- 
quired rate with no assurance of immediate correct- 
ness, but the existence of faulty output would be de- 
tected, and corrective action initiated within one 
basic system cycle time. 


THE PROCESSING UNITS (PU) 


These devices are assumed to be conventional, off- 
the-shelf (mini) computers, or even single chip com- 
puters, with the possibility of a few minor additions. 
In particular, each machine requires a CM whose 
registers can be individually loaded under external 
control. A hardware (multiargument) voting opera- 
tion would be very desirable, though certainly not 
essential. Each PU has a small amount of internal 
working storage capacity (scratch pad memory). 


THE MEMORY UNIT (MU) 


The MU's are random access storage devices which 
have the primary responsibility for verifying and 
maintaining the application data bases; that is, the 
sensor data and precomputed information needed to 
calculate the new effector control signals. Thus, if 

a disagreement is detected in the computed outputs of 
paired PU's, and a new PU is assigned the task of 
resolving this disagreement, the new PU can access 
one of the relevant MU's to obtain the data appropriate 
to its computational task. 


Like all other major system units, the MU has a CM 
and some simple logic which includes the ability to 
compare and vote on multiple data items. During 
normal operation, a MU is assigned to a single PU 
and obeys its storage (only during P2 and P4) and 
retrieval commands (at any time) by paying attention 
to the appropriate register inits CM. It also inter- 
cepts, votes on, and stores data from sensors rele- 
vant to the tasks assigned to its associated PU. If 
the sensor data is not consistent, it stores the ma- 
jority opinion and tallies the disagreement for later 
fault reporting. Before storing data from its own PU 
(i.e. , permanently altering the data base), it com- 
pares this data with that produced by the paired PU; 
if there is a disagreement, no storage takes place 
until after the resolution of the resulting HELP re- 
quest. During the HELP procedure, it responds to 
information requests from other PU's assigned to 
the failed task. In P4, following the help request, it 
determines and stores the majority opinion. 


A MU can be reassigned to another PU if its own PU 
is judged to be inoperative by a majority vote of the 

PU's in the system. It can also be used in a shared 
mode (i.e., it now services more than one PU) when 
so instructed by either its own PU, or by a majority 
vote of the system PU's. 


We note that the data base associated with each task 
is normally stored (in verified form) in two MU's. 
However, to be able to resolve a disagreement in the 
stored data should one of the MU's suffer a failure, 
we require that error detecting encoding be employed 
by the MU's. 


THE INPUT/OUTPUT PROCESSORS (IOP) 


Input functions such as signal sensing, A/D convers- 
ion, multiplexing, and buffering are usually required 
in interfacing a computing system to the real world. 
In a fault tolerant system, these functions must 


satisfy the same criteria with respect to redundancy 
and isolation that we require of our computing com- 
ponents. Thus, we would expect that critical sensing 
devices be at least triply redundant, and their signals 
reach the computer through at least three independent 
processing (A/D conversion, multiplexing, and buffer- 
ing) paths. We will call each such path an Input Pro- 
cessor, and assign it a register in all CM's of the 
system. 


The normal output functions of a real time control 
system, including D/A conversion, demultiplexing, 
and buffering, as well as final effector activity, must 
again satisfy the fault tolerant criteria that is required 
of the rest of the system. In the case of the effectors, 
this consideration can be critical, since a single (non- 
redundant) effector performing a vital function can 
render useless all the prior redundancy built into the 
system. Thus, for each task which involves effector 
activity, we require at least triple redundancy. Some 
effectors can provide this capability in a single device 
by internally performing a voting operation (e.g., an 
actuator with multiple inputs which performs the vot- 
ing hydraulically); however, even this capability is 

not satisfactory due to the lack of isolation in achiev- 
ing the redundancy (i.e., this single device is still a 
hardcore item). 


Given that we have at least three independent signal 
paths to three independent effectors for each output — 
task, we now must drive each of these paths with the 
control signals produced by two or more independent 
PU's. An Output Processor is defined as a device 
which includes the signal path terminating in an effec- 
tor, and which obtains its output information based on 
a vote of the relevant information appearing in its CM 
(i.e., either agreement of two data items in P2, or 
plurality of three or more items in P4). 


THE APPLICATION SOFTWARE (AS) 


The proposed architecture imposes a number of con- 
straints and requirements on the AS; these include: 


(a) The requirement for three or more distinct ver- 


sions of the software for each application task (to 


help detect design and translation errors, as well as 
damage faults, in both hardware and software). While 
considerable attention has been devoted to determining 
whether two logical constructs are identical, the 
question of criteria for establishing degrees of dis- 
tinctiveness does not appear to have been previously 
considered. For most cases of practical interest, we 
might assume that modules programmed by different 
programmers would satisfy the distinctness require- 
ments. The topic of distinct software is discussed 
further below. 


(b) The requirement for a common data base; that is, 
all software versions of each specific task must be 
able to use information stored in a common data base. 
This programming convention is necessary to allow a 
new version of a software module to be called into 
execution to resolve a conflict, without having to 
either maintain or generate a separate data base for 
such a standby software module. Ina control applica- 
tion environment, where the data base would typically 
contain previously recorded sensor values and com- 
puted system states, a standard method of storing 
such data seems quite reasonable. 


(c) The requirements for partitioning the software 


into segments which can be executed within the Pl 
time interval, and which will produce an (intermediate 
or final) output suitable for comparison and fault de- 
tection. The partitioning requirement is compatible 
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with the need for control signals at short periodic 
intervals typical of the control environment. Even in 
those cases where the computation must extend over 
many basic system cycles, and no reasonable inter- 
mediate results can be produced prior to completion 
of the computation, normal error protection can still 
be provided if the computation time interval (in system 
cycles) is a suitably small fraction of the required out- 
put cycle time. In this case the verifying computation 
will not be completed in a single P4 interval but will 
extend over a series of P4 portions of successive 
system cycles. 


Distinct software. As noted above, an important 
aspect of the present fault tolerant design is the avail- 
ability of distinct software, i.e., processors per- 
forming the same task must each have a program 
which satisfies the same specifications, but each pro- 
gram must use procedures that are ''different'' in some 
sense. Since identical processors are to be used in 
the system, the motivation for such distinct programs 
is that if identical processors are subject to the same 
fault situation, the programs (and hence the proces- 
sors) will be in a different state when the fault occurs. 


In converting from specification to procedures, dis- 
tinct programs can be obtained by utilizing: 


1) Different theoretical methods of converting speci- 

fications, e.g., using different methods for mechan- 
izing the z-transform, or different methods of mech- 
anizing trigonometric functions. 


2) Different procedural methods, obtained either by 
using two or more persons writing programs based 
on the same specifications, or by the conscious inter- 
change of independent software procedures. 

II, SUMMARY AND CONCLUSIONS 
In this paper, we have presented an architecture 


which satisfies the following requirements for real 
time control applications: 


1) Ability to deal with software as well as hardware 
faults: The proposed architecture is based on the 
assignment of distinct but redundant software modules 
to each task. We have shown how communication, 
synchronization, and resource allocation can be hand- 
led at the system level to deal with the problems 
arising from such an approach. 


2) Efficient use of resources: The proposed archi- 
tecture is a multiprocessor using time redundancy 
for fault correction. Thus, redundancy (beyond the 
minimal requirement of duplicate computation needed 
for fault detection) is invoked only when a fault is 
detected. In normal operation, this extra capacity is 
available as an additional computing resource. 


3) No hard core: In addition to the usual replication 
of system components, we have defined a distributed 
and partitioned system executive and a unique com- 
munication facility which insures that the available 
redundancy will not be lost through a ''domino" effect. 
In particular, we have addressed and posed a solution 
to the question of a defective unit 'locking-up'' the 
communication channels, and bringing down the entire 
system. 


4) Interaction of computing units with sensors and 
effectors: We have discussed how system architecture 
must be responsive to the amount and type of redun- 
dancy provided by the sensors and effectors. 


5) Use of current technology: The proposed archi- 
tecture is based on the use of currently available 
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hardware for the major system components. For 
example, the processing units will typically be con- 
ventional minicomputers or even single chip computers. 


Since we have been primarily concerned with the logi- 
cal organization of a fault tolerant architecture ata 
systems level, there are many questions we have not 
addressed. Thus, for example, the details of making 
the individual PU's internally fault tolerant has not 
been considered. We would assume that redundant 
power supplies are used, but have not discussed this 
point. Physical separation of the units and separation 
of the communication busses seems desirable, but was 
not discussed. 


We raised the question of distinct but redundant appli- 
cation software modules with requirements for a 
standard data base, and rough synchronization of 
computation. While achieving these requirements for 
a particular application seems relatively straight- 
forward, our continuing efforts will be directed to 
formalizing this process. Finally, we feel that the 
proposed architecture is suitable for a general prob- 
lem environment as well as real time control appli- 
cations; we plan to extend the simple supervisor pre- 
sented here to permit extension to the general multi- 
processor domain. 
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APPENDIX A 
TECHNIQUES FOR INTERMODULE COMMUNICATION 


Given a set of computing, memory, and I/O modules 
which are to contribute to the processing tasks ofa 
system, there are two major factors which affect 
the communication between these modules: 


1 - Connection characteristics, the nature and topology 
of the paths between the modules. 


2 - Control mechanisms and communications protocols, 
the method of controlling the communication be- 
tween modules and the nature of the communication. 


Various types of topology and control commonly used 
are indicated in Table A-l, and examples of some of 
the communications systems used in fault tolerant 
designs is tabulated in Table A-2. 

TABLE A-1 


Types of Communication Interconnections, Control 
Mechanisms, and Protocols 


1 - Connection characteristics 


a - Topology | 
Each module connected to a central redundant 


module links may 
Each module connected to n other be used in 
modules i each of 
Each module connected so as to these 
form a ring topologies 


b - Nature of connection 
Direct wire, module to module 
Single or multiple common bus (uni- or 
bi-directional) 


2 - Control mechanisms and protocols 


Permission to send/receive given by central 
unit, or according to standard protocol (e.g., as 
in a conventional computer multiplexer bus). 


Each module has specified time slot to send/ 
receive. 


"Lazy Susan'', messages inserted into communi- 
cation stream when empty slot appears. 


Random transmit/ receive. 


TABLE A-2 


Examples of Different Communication Approaches 


ARPA Net 
(Ref. 8) 


Controls and Protocols 


Topology 

Each station communi- 
cates with two or three 
other stations 


No central control. Lazy 
Susan variation; packet of 
information addressed to 
destination with no specific 
routing indicated. Packet 
is forwarded from node to 
node until destination is 
reached. 


UC Irvine Ring 


(Ref. 5) 


Topology 
Ring 


Controls and Protocols 


No central control. Lazy 
Susan arrangement with 
each processor examin- 
ing the data stream as 

it goes by. Addressing 
is by process rather 
than destination. 


SIFT, Stanford Research Institute 


(Ref. 9) 


Topology 


Multiple common bus 
to each module; single 
dedicated line between 
each processor and 
its memory 


STAR, JPL 
(Ref. 1) 
Topology 
Common busses 
TRW 
(Ref. 3) 


Topology 


Each CPU and associated 
memory are connected to 
an input/output control 
unit (IOCU). Each IOCU 
is connected to a set of 
multiple data busses. 
External devices com- 
municate with the com- 
puter via these busses. 


Controls and Protocols 


Processors can read 
from any memory via 
common bus; processor 
can only write into its 
own memory. 


Controls and Protocols 


Central control by TARP 


Controls and Protocols 


Central system control 
unit monitors and cone 
trols communication. 


Autonetics 


(Ref. 2) 


Lopology 
Four CPU's on multiple 
common data bus 


Controls and Protocols 


VCS control to external 
output. 


Proposed System 


Topology 

Multiple common busses, 
one bus originating at each 
module, and leading to 
read-only registers in 

all other modules 
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Controls and Protocols 


Random transmit capa- 
bility by each module 

on its own bus. Random 
receive/read capability 
by each unit with respect 
to its read-only registers 
which receive information 
from the multiple busses. 


APPENDIX B 
TYPES OF SYSTEM EXECUTIVES 


The executive of a fault tolerant system receives 
reports of system failure, controls the procedures 
that are to be followed when such failures occur, and 
re-allocates resources as required. The executive 
can take one of the following forms: 


1 - Central executive 


A central executive is one in which a single 
monolithic structure controls the overall system 
from a single PU. The two mechanizations possible 
are: 


Hardware executive, e.g., a voter/comparator/ 
switch (VCS) unit, whose logic design determines 

the procedures to be followed. This type of executive 
is usually limited to controlling system output based 
on majority vote of identical computations, switching 
out defective modules, and switching in spares. (Ref. 2) 


Software executive. An executive mechanized in soft- 
ware is usually designed to handle a more complex 
set of situations. The software executive often deals 
with failures by changing the task allocation tables, 

so that defective modules are lightly loaded or ignored 
altogether. For reliability purposes, duplicate copies 
of the executive may be available in auxiliary storage, 
or even used to monitor the performance of the con- 
trolling executive. (Ref. 3) 


2 - Distributed executive 


In the distributed executive, an attempt is made 
to spread the executive functions for both efficient 
operation and so that in case of damage to one part 
of the system, executive capability will still be 
available. T'wo types of distributed executive are 
possible: 


Multiprocessed executive. The tasks of the executive 


are carried out by more than one computer, and vot- 
ing is then used on the multiple outputs. An example 
of this approach is the SIFT, Ref. 4. 


Partitioned executive. The present design uses a 
partitioned executive in which each part of the system 
operates autonomously, based on observations of how 
the rest of the system is performing. 


Discussion 


In most of the multiprocessor approaches, if the 
executive encounters a design error which causes it 

to fail, the entire system can fail. For example, if 

a particular combination of tasks causes a "lock-up" 
condition, then the system will not be able to proceed 
past that state. In the case of the partitioned executive, 
however, each module is operating in its own task 
sequence under its own executive. If an executive 
failure should occur, only that processor is affected 

by the failure. 


157 


A VARISTRUCTURED FAIL-SOFT 
CELLULAR COMPUTER 


G. J. Lipovski 
University of Florida 


ABSTRACT 


The architecture of a von-Neumann class computer 
is considered, in which the user programmer can request, 
at the beginning of his task, one of many word widths, 
and one of many memory heights. Several users are able 
to space share the computer. We call this feature 
varistructure. The computer is a minimally, yet 
strongly connected cellular structure consisting of 
microcomputers, and has the capability of being fail- 
soft. 


1. INTRODUCTION 


There is a growing desire for fail-soft computers. 
Especially where the computer performs an essential or 
very important function, it would be very desirable 
that the whole system need not stop working where one 
part fails. There is also a strong desire for a 
computer that is made of at most a few basic modules 
which are connected in a regular way, that is, a 
cellular computer, especially if the number of con- 
nections is minimal. If these objectives can be 
obtained for a computer that looks like a conventional 
von Neumann class computer to programmers, that computer 
should be quite effective! We will show a computer 
having these characteristics. 


In this paper, we will consider the techniques used 
to support varistructure. In the next section we look 
at the cell interconnections. We consider the nature 
of the interconnections, the topology. Then we 
interpret this in terms of data transmission paths. In 
section 3, we examine the cell. We consider the con- 
struction of a suitable cell for our examples, general 
operation of an instruction cycle, and the meaning of 
STRUCTURE states. In section 4, we consider the 
operation of the cellular machine for the memorize of 
recall cycles. We describe the variable structure 
concept and the mechanism to select the height and 
width of memory. In section 5, we show how the execute 
cycle can be done. In particular, we consider the carry 
link operation in this machine. Section 6 shows the 
fetch cycle. Section 7 shows how the structure can be 
set up and section 8 gives our conclusions concerning 
this machine. 


In order to explain the techniques used in this 
processor, we will arbitrarily choose an eight bit wide 
CPU and memory configuration. We will assume the 
address is sent on a separate link. We will also assume 
a standard one accumulator CPU structure. None of these 
assumptions are necessary for the architecture. In 
particular, there are many ways to time-share the links 
to decrease the number of pins, and so on. We would 
not propose building the machine in the way we describe 


it. However, it is expedient to simply describe an 
example of this architecture so that the techniques are 
clearer. We choose the simplest example. 


2. CELL INTERCONNECTIONS 


2.1 Topology 

The study of the interconnection of cells, the 
topology, is the key to minimal connected fail-soft 
computers. We propose to seek out the class of all 
structures with the property that, for a fixed number 
of nodes, there is a minimum number of (bidirectional) 
links such that the graph is strongly connected, and 

if any link is deleted, it is no longer strongly con- 
nected. All such graphs are trees! As we noted earlier 
(1), and as T. C. Chen also observed (2) the tree struc- 
ture is fail-soft such that if any node is faulty, say 
an ontput driver is stuck-on-one, then the subtree can- 
be pruned from the remainder of the tree, and the re- 
mainder can continue operating at reduced capacity. 
Since, in a homogeneous 2-level tree with fexed fanout 
f, there are £* leaf nodes, f£ -1 nodes on the next 
higher level, and so on, almost all nodes are leaf 
nodes. For example, ina binary tree, half are leaf 
nodes, and in a ternary tree about 65% are leaf nodes. 
So a failure in most of the tree will cause small loss 
of performance because only a small subtree containing 
the failure will be extracted. 


We also observe that a tree with fanout f having 
n nodes has delay approximately proportional to logs n. 
It is also possible to put the tree structure in a 
physical space in which the total delay is proportional 
to av/n+b loge n (3). Although there are structures 
that have lower delay, they also have more connections 
through which it is difficult to stop a stuck-on-one 
fault. 


2.2 Broadcast Domains 


The tree links consist of a data link say, L[0~7], 
an address link A[{0~15], a control link, say, K[0,1], and 
priority/carry lookahead circuitry,to be used for normal 
operation. (Four more links will be introduced later.) 
Links A, L and K are bidirectional amplifiers which 
are independently opened or closed electronically (4). 

A subgraph of the tree in which L is connected is 
called a data domain, for which A is connected, an 
address domain, and for which K is connected, a control 
domain. These three domains are broadcast domains. 
During an event of the process, one or more cells will 
broadcast into a data domain in L, the data will be 
wire-OR'ed, and, transferred to all cells in the data 
domain in the same event. The control domain in K and 


161 


address domain in A work the same way. 


3. THE CELL 
3.1 Construction 


The cells consist of a microcomputer CPU and some 
random access memory (see Figure 1). For simplicity, 
we will assume that the random access memory is, say, 

a 1K x 8 bit page with a 16 bit address. The high order 
6 bits of the address of the words on this page will be 
the same, and will be called the page number. The page 
number will be stored in a register PAGE[0~5] in the 
memory decoder. When an address A[0~15] is presented 

in an address domain, the memory with PAGE[0~5] equal 

to address A[0~5] will read, or write data from the 
link L into a word on that page chosen by bits A[6~15]. 


The microcomputer will have an arithmetic-logic 
unit, a microprogram store, an instruction register 
I[0~7], and a suitable collection of registers for 
programming. For simplicity, we will assume that these 
are a temp register TEMP[0~7], an accumulator ACC[0~7], 
a program counter PC[0~15] and one index register 
X{0~15]. Even though more registers will be required 
in a practical machine, these registers will be 
suitable to demonstrate the technique of varistructure. 


L,KA 
Microprogram 
Store 
CPU 
Cell Arithmetic 
Logic Unit ; 
Memory 


Figure 1. A Cell 


3.2 General Operation 


Operation of the microcomputer will be similar to 
that of the von Neumann computer. An instruction will 
consist of a sequence of microinstruction cycles 
including a fetch cycle and a memorize cycle, recall 
cycle, or execute cycle. For example, a typical ADD 
instruction would consist of: 1) a fetch cycle, where 
PC is sent out as an address on link .A, the returning 
data on L being stored in the instruction register I 
and decoded; 2) a recall cycle in which the word 
pointed to by index register X is put in TEMP and; 3) 
an. execute cycle, in which TEMP and ACC are added, the 
result being left in ACC. The control link K is used 
to set up broadcast domains for each cycle. For a fetch 
cycle, K is 00, for an. execute cycle, 01, for a memorize 
cycle, 10, and for a recall cycle, 11. 


3.3 Operation of STRUCTURE 


Finally, each cell will have a structure state 
STRUCTURE [0,1]. During each cycle, STRUCTURE will 
determine the limit to which the address and data 
domains extend and the behavior of the CPU. (The 
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memory behaves the same for all states.) We will 
discuss the four values of this variable as we consider 
the operation of the processor. The programmer 
determines his configuration by loading the value of 


STRUCTURE in each cell before the program and data are 


loaded in. In general, when STRUCTURE is 00, the CPU 
in the cell will become "passive" and the K, A and L 
links above (towards the root from) this cell closed 
switches; when STRUCTURE is 01, the CPU will be a "byte 
slice". It will behave as a one byte slice of a 
parallel ALU other than the left slice (containing the 
sign bit), the microinstruction decoder will be ac- 
tivated, and L will be open while K and A are closed 
above the cell. When STRUCTURE is 10, the CPU will be 
a "left terminal". It will behave as the leftmost one 
byte slice of a parallel ALU. Only one CPU of a group 
will determine branching and so on. The left slice CPU 
will send out control lines to memory. The micro- 
instruction decoder will be activated, to decode an 
instruction, and L, K and A are open above this cell. 
Left terminal and byte slice CPU's will be called 
"active" CPU's. (Another mode can be added to handle 
vectors, but we will not consider this simple ex- 
tension.) We will discuss the operation of the recall1/ 
memorize cycle in the next section, which will show 
how data can be brought into the ALU in such a way that 
it can be treated as variable width data. We show how 
this is done in the following section, when the execute 
cycle operation is shown. 


4. MEMORIZE AND RECALL CYCLE OPERATION 
4.1 Variable Structured Data 


We will now consider a technique whereby the pro- 
grammer can select the width and height of his random 
access memory. We will consider a uniform tree with 
fanout 3, although other configurations are obviously 
possible. With this configuration, the programmer can 


n 
select heights of 1, 4, 13 or any number 2f 37K 
i=0 
words, and widths of 1, 3, 9 or any number 3" bytes per 
word. Of course, with a fanout of f, the height can be 


n 
x £"K and the width can be f". 
=0 


4.2 Selection of Memory Height 


Starting at the leaves of a tree, all cells can 
have STRUCTURE equal to 00 so that the CPU will be 
passive and the memory will. be made available to the 
data and address links. Suppose there is one that is 
n levels (say. 2) away from the leaves that has STRUCTURE 
not equal. to. 00, all those below it having STRUCTURE 
equal to 00 (see Figure 2). 


STRUCTURE is 00 STRUCTURE is 10 
(CPUs passive ) Py, or 01 (CPU active) 
n= 2 levels /f-'m 
from leaves 


page numbers 
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Figure 2. Height selection of byte slice 


The A, L and K links between the cells in this subtree 
are connected together, and only the topmost CPU is 
active, the others being passive. Since, in the recall 
cycle, any cell can read a word into the link, and from 
there to the active CPU, we can assign different page 
numbers to different cells. They do not need to be in 
any order (see Figure 2). The numbers next to the nodes 
are pages numbers. On the recall cycle, an address is 
sent by the active CPU to all cells on the connected A 
link in the address domain, and one of them will match 
the high order bits of the address with its page number 
PAGE. It will send a word on the connected L link in 
the data domain. The active CPU will load this data 
into its TEMP register. The memorize cycle will, of 
course, be similar. 


4.3 Selection of Memory Width 


A collection of subtrees that constitute a data 
domain can be in a large subtree that constitutes an 
address domain. 
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Figure 3. Width selection combining byte shices 


This can be accomplished by making the root cell of the 
larger subtree have STRUCTURE 10 making it a left slice 
CPU. It will also completely delimit the link com- 
munication above it. In this way, when a cell with 
STRUCTURE 10 broadcasts an address on the A link below 
it, it goes to all data domain subtrees simultaneously, 
and each recalls or memorizes a word of data separately 
at the same logical address. 


It should be noted that the leftmost data domain 
subtree is larger than the other two data domain sub- 
trees in Figure 3. All that we require is that each 
data domain has a unique page number for every page of 
data that contains data to be used with a memorize or 
recall cycle. The leftmost data domain will have an 
extra memory page which, at this point, need not be 
assigned. Indeed, because of the failure of some cells, 
it may well be that each data domain has a different 
number of nodes. If n is the minimum of the number of 
good nodes in any data domain, then pages in all data 
domains can be numbered. from 0 to n-l. 


It should be evident that during a recall or 
memorize cycle, the size of memory can be any of a 
number of widths and heights. This selection is made 
by assigning the values of the value of STRUCTURE in 
each cell. Finally, since the topmost cell in an ad- 
dress domain disconnects all links above it, it should 
be evident that a large tree can be space-shared, where 
different problems are run in different address domain 
subtrees. 


5. EXECUTE CYCLE OPERATION 


In the execute cycle, we assume that an operand is 
available in ACC and possibly one is available in TEMP. 
For logic operations such as negate or AND, the active 
CPU's will simply compute the result in parallel. (We 
will discuss in a later section how instructions appear 


in all active cells so that the cell CPU can execute the 
correct operation.) The only real problem is the com- 
munication on the carry link or the right shift link 
between active CPU's. We will consider the carry link. 
The right shift is similar. To see how this operation 
is done, we first look at the operation of the carry 
lookahead adder. In particular, we choose a standard 
carry lookahead module (74182). (We assume the reader 
is familiar with the operation of carry propagates, 
generates and carry inputs of this module.) 
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Figure 4. Connections of a carry lookahead unit 


The carry-in, generate and propagate links of the 
CPU in that node are connected to the fourth (leftmost) 
set of links to this module, and the carry-in, generate 
and propagate links going rootward from each subtree of 
the node are connected to the carry lookahead module 
of that node. The set of links, group carry-in, group 
generate, and group propagate, are connected to the 
next rootward node carry lookahead module. The effect 
of this connection is to put all nodes in so-called 
left list matrix order (5) so that they can be thought 
of as being ordered in a chain, even though the delay 
is logarithmically related to the number of cells. 
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Figure 5. Carry logic 


The manner in which the tree is made to look like 
a chain of CPU's is as follows (see Figure 5). LOOK- 
AHEAD 0 distributes carry signals to its attached CPU's 
and forms a group generate and group propagate shown 
above it. This is input to LOOKAHEAD 1 as though it 
were a CPU input in LOOKAHEAD 0. Consequently, CPU 4 
will see a carry generated by a CPU to its right if 
the intermediate propagates are all ones. This is 
equivalent to the operation of the ripple adder con- 
nected as in Figure 5b. Note also that if the group 
generate of LOOKAHEAD 1 is connected to its own carry 
input, and the unconnected inputs have Propagate = 1, 
Generate = 0, then the carry out of CPU 4 is the carry 
into CPU.0, which is equivalent to the end-around carry 
shown in Figure 5b. While end-around carry is not used 
as in one's complement arithmetic, this path enables the 
root cell of an address domain to set the carry input 
of the least significant CPU. 


To operate the execute cycle, then, we use the 
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following scheme. The general scheme is to generate a 
carry input for the end-around carry in the topmost 
cell of a data domain, which serves as the leftmost CPU 
in the chain. The carry is passed from one active CPU 
(having STRUCTURE 01) to another, bypassing inactive 
(STRUCTURE 00) CPU's and subtrees (roots having 
STRUCTURE 10) that are disconnected from it. Bypassing 
is done by setting generate to 0 and propagate to l. 
The rules for connecting the cells are shown in Figure 
6. Figures 6a and 6d show how root cells of data 
domains should behave. The end-around carry is im- 
plemented for the cell itself, and the links above it 
are connected so that the tree above it bypasses it. 
This enables the root cell to set the carry into the 
whole adder to zero for addition, or to one for sub- 
traction and so on. (Were this not possible, the cell 
in the tree corresponding to the rightmost cell of the 
carry chain of Figure 5b would also have to be specially 
designated by having a different value of STRUCTURE.) 
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FIGURE 6. Connection of leaf and non-leaf cells 
for different values of STRUCTURE (given in 
parentheses) during the execute cycle. 


Secondly, a data domain root cell, with STRUCTURE 
01, will connect as in Figure 6b or 6e. Because it is 
a CPU in a byte slice of a larger chain, it contributes 
carry generate and propagate signals, and accepts carry- 
in signals in the normal way. 


Thirdly, an inactive cell, with STRUCTURE 00 will 
behave as in Figures 6c and 6f. 


It should be apparent that the carry chain is 
connected for addition, subtraction, and shift left 
(i.e., add a number to itself). The right shift re- 
quires a link that is easily connected in the reverse 
direction to the carry link. 


6. FETCH CYCLE 


The fetch cycle is responsible for getting the 
instruction to all active CPU's. We will show how an 
eight bit instruction can be sent to all of them. It 
is accomplished by the following scheme. We will assume 
that conditional branching will be done only on the 
sign bit for simplicity. Since the sign bit is ina 
left end CPU, that CPU is the only one that can evaluate 
a conditional branch. So it alone will send out the 
address of the instruction in a fetch cycle, as it did 
for the recall/memorize cycles. The address is derived 
from PC and PC is incremented if no branch is taken, 
or from X if a branch is taken, as in standard computer 
organizations. However, the data domain is made the 
same as the address domain so that the instruction is 
sent to all cells. This is done by delimiting the 
data link L only above a cell with STRUCTURE 10. All 
cells will load the instruction into their I register. 


In order to support this scheme, it is necessary 
that the words addressed as instructions are in only 
one memory page in the entire address domain. This is 
accomplished by storing instructions in pages that are 
not used for data. For example, in the tree in Figure 
3, each byte slice subtree may have a page address from 
0 to 11 for data. Each byte slice subtree then has one 
page left over, and the leftmost byte slice has two 
pages left over. Thus, four pages are available to 
store instructions (assuming no faulty pages are in 
tree). These pages will be given page addresses 12 to 
15. All instructions will then have to be in pages 12 
to 15 so that just one 8 bit word will be read during 
a fetch cycle. 


the 


It should be evident that at the end of a fetch 
cycle, all active CPU's have the same instruction. 
Thus, they can decode the instruction in parallel. The 
decoded signals will control the CPU's. Only the left 
slice CPU will control the address and memory, however. 


7. SET-UP OF THE STRUCTURE 


The normal operation of the machine is defined in 
terms of address, control and data domains, as we have 
shown in the earlier sections. These are established 
by setting STRUCTURE and PAGE in each cell. This can 
be done to avoid faulty cells. We consider how this is 
done now. The problem of identifying faulty cells and 
excising them is first discussed as part of the set-up 
procedure. Then the problem of arranging the structure 
is considered. Finally we consider the input/output 
strategy, which is also used to request a set-up. 


7.1 Identification of Faulty Cells 

As we noted earlier, the bidirectional amplifier (4) 
can be used to isolate faulty cells, especially those 

in which an output amplifier is stuck-on-one. The 
bidirectional amplifiers in the links can be arranged 

so that they transmit information only from the rootward 
cell to the leafward direction. Hence, a stuck-on-one 
fault in a tree structure cell tends to be in a leaf 
cell, so that most of the machine is able to receive 
information from the root of the entire tree. A fault 
diagnosis program can be entered at the root to exercise 
all the memory chips independently, as memory is 
commonly exercised in a diagnostic program in any 
computer. The CPU can also be exercised, either with 
instructions from the root of the entire tree or from 
routines stored in the memory pages when they have been 
checked out. A faulty memory can be given a PAGE number 
that is not used by the programs to be executed, so 

that it is not used. A faulty CPU can excise itself 
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by setting STRUCTURE to 00. A subtree containing a 
faulty link (stuck-on-one) can be excised by setting 
STRUCTURE to 10 in the root of the subtree. The test 
for a faulty link is simple. If a cell and each of its 
sons in the tree structure are found to be faulty, the 
cell will have its STRUCTURE set to 10. 


This scheme simplifies the fault detection program 
to the problem of checking just one cell. All cells are 
then checked in parallel in this machine. 


7.2 Setting of STRUCTURE and PAGE 

Several techniques are possible for setting these 
values in each cell. A simple technique is serial 
transmission of the address and data on two broadcast 
links, U and V, and a propagating (store and forward) 
link T (see Figure 7). We discuss a binary tree here 
for simplicity. This scheme is similar to Berkling's 
address scheme (6). U contains (sequentially) the 
address and values of STRUCTURE and PAGE. Vis 1 if 
the corresponding bit is an address, zero otherwise. 
At the beginning, V is 1, T is 1 into the root cell. 
The bit on U carries T to become 1 on the link to the 
left son of the root if U is 0, and the right son of the 
root if U is 1. The sequence of address bits can be 
sent on U as V is 1, while T walks down the tree. For 
example, to address cell A in Figure 7, U would be the 
sequence 1, followed by 0. When the desired cell has 
been reached, V is set to zero. The register pair 
STRUCTURE, PAGE then would operate as a shift register, 
shifting the value of V into it. The cell would be 
rendered inoperative until the eight bits have been 
shifted in. This process would be repeated, first 
setting T to 1 and V to 1 at the input to the root for 
supplying addresses and values, to all cells. 
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Figure. 7. Selection of Cells by Tree Address 


The above procedure can be executed to set up a 
structure. It can even be set up in one subtree while 
other subtrees are doing useful work. It only requires 
that a way is available to move data into the pages set 
up, and conversely, for a processing tree to request the 
root cell to change the structure. This is part of 
the general input/output problem, which we consider 
next. 


7.3 Input and Output 

One feasible scheme for input and output would be 
to build a tree of bidirectional links (I/0 link) in 
parallel to the tree used for processing. The input 
scheme would be to select a cell, as we did in the last 
section, and broadcast the data on the I/0 link. Only 
the selected cell will respond. For output, the entire 
tree can have a hardware priority structure such that 
a cell can request a path from the root to select the 
cell. Then it can broadcast into the I/0 link tree. 
Only the root of the tree will respond. Data can be 
output, and requests to change the structure can be 


sent to the root cell in this manner. However, a large 
number of possibilities exist, with different advantages 
in terms of cost or throughput. It is necessary, how- 
ever, that faulty cells can be deleted from this tree 

as well as the tree used for processing to avoid stuck- 
on-one faults in the link. The same procedure that we 
discussed for checking for stuck-on-one faults in the 
processing tree can be used in this particular I/0 link 
tree. 


8. CONCLUSIONS 


We have described a novel computer architecture 
that offers most of the advantages of cellular machines, 


yet is sufficiently similar to a standard von Neumann 
computer once it is set up that it will be possible to 
program it. The techniques described in this paper 
show it is possible to select the height and width of 
memory to be used in a task. It is also possible to 
avoid faulty cells in this structure. 
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ABSTRACT 


Because of dramatic reductions in cost of mini-compu- 
ters, peripherals and logic modules, it is becoming 
evident that many problems confronting the computer 
system designer will be solved in the future by hybrid 
designs involving not only software but also speciali- 
zed computers with architectures best suited to each 
application. Accordingly, hardware research must no 
longer be considered as a separate discipline by sys- 
tem programmers but as a tool in exactly the same way 
as languages. To illustrate this philosophy, a hard- 
ware laboratory has been set up at the University of 
Montreal. The primary interest of the founders was in 
designing and building small specialized computing sys- 
tems. 

This paper describes some of the aspects of the labo- 
ratory with emphasis on two major developments: 

(1) The design of a programmable I/0 switch between 
two mini-computers. 

(2) The addition and monitoring of a writable control 
store connected to one system. 


1. INTRODUCTION 


Because of dramatic reductions in cost of mini-compu- 
ters,peripherals and logic modules, it is becoming 
evident that many problems confronting the computer 
systems designer will be solved in future by hybrid 
designs involving not only software but also specia- 
lized computers with architectures best suited to each 
application. It follows that hardware research is im- 
portant to any Computer Science Department as a whole. 
However, research in computer architecture is not pos- 
sible on existing Computer Centre machines. These are 
committed to providing a service to the software com- 
munity and cannot allow internal modifications or the 
attachment of non-standard peripherals. Further, be- 
cause of the time required to produce and market com- 
puters, these machines are obsolete and their design 
is 5 to 10 years behind that of the prototypes being 
developed by the manufacturers. By working only with 
available machines, a university researcher is, ina 
sense, developing algorithms for yesterday's compu- 
ters. 

To fill this hardware gap, at the end of 1971, a hard- 
ware laboratory was set up by the authors and Prof. 
Paul Bratley at the Computer Science Department of the 


University of Montreal. The primary interest of the 
founders was in designing and building small speciali- 
zed computing systems but the aims of the laboratory 
went beyond this in that the laboratory should provide 
hardware facilities to aid research in all areas of 
the department in roughly the same way as a computer 
center provides software facilities. In short, we 
wanted to extend the range of options available to our 
researchers to the fields of hardware and firmware[14]. 
To determine the equipment needed in the laboratory, 
it was necessary to consider the probable uses that 
would be made of this equipment. Here is a list of 
research areas that were thought to be likely users of 
the laboratory facilities (asterisks indicate that pro- 
jects are currently under way in a given area): 
(1) Microprogramming [2, 5, 12, 15] 

-~ Testing new instruction sets * 

- Emulation of high level machines * 
(2) Storage management [1, 3] 

- Virtual memory and segmentation * 

- Hardwired garbage collection 
(3) Real-time operation [ 6, 7] 

- Process control 

- Time sharing supervisors * 

- On-line processing of sounds and images * 

- Telecommunication networks 
(4) Performance measurement [8] 

- Hardware monitors * 
(5) Trial designs with new technology 

- LSI, COSMOS, MOSFET * 
(6) Specialized peripherals 

- Adaptive learning networks [9] 

- Stochastic computer * [4] 

- Associative memory for network algorithms [11] 

- Sort-merge processor * 


To provide hardware support for the projects listed 
above, the laboratory should have a very flexible con- 
puter "test-bench" in addition to the traditional elec- 
tronic test instruments. There are four major require- 
ments for this "test-bench": 

- it should support microprogramming; 

- connection to peripherals and other computers 
should be simple; 

- it should have software writing tools from the 
start (compilers, assemblers and secondary 
storage); 

- the cost should be as low as possible. 

In the rest of this paper, we describe the computing 


hardware purchased for the laboratory and consider in 
detail two important hardware modifications that were 
necessary to give the system the required flexibility: 
the addition of a writable memory for microprograms 
and a modification to the I/O structure to allow sha- 
ring of peripherals between two computers. 


2. LABORATORY COMPUTING EQUIPMENT 


The "test-bench'"' is based on two nearly identical 
INTERDATA 4 mini-computers. Experience in the labo- 
ratory has shown that it is essential to have two com- 
puters so that software development can proceed on one 
while the other is laid up for hardware modifications. 
The INTERDATA 4 was chosen because it has several fea- 
tures useful for our applications [16]: 


2.1. Microprogram control: The INTERDATA 4 runs under 
the control of a microprogram in a read-only memory. 
The machine structure permits the replacement of this 
memory by a writable control store, allowing dynamic 
microprogramming. 


2.2. 16 General purpose registers: When emulating 
virtual machines, it is possible to reserve a few re- 
gisters for special use and still leave enough for ge- 
neral purpose use by the programmer. 


2.3. Instruction set: The instruction set is closely 
related to the IBM/360 and the INTERDATA is programmed 
like bigger machines. Out of 256 instructions possible 
with the 8 bit OP-code, only 86 are implemented by 
INTERDATA. There is therefore room for new micropro- 
grammed instructions. 


2.4. Simple I/O BUS [17]: The standard multiplexer 
1/0 BUS is simple and easy to connect to. It has only 
27 lines: 8 for input, 8 for output and 11 for con- 
trol, test and initialisation. This compares favour- 
ably with another machine which was also considered 
for the laboratory: the DEC PDP/11. The DEC UNIBUS 
has 56 lines which combined with 64 grounds make a 120 
wire BUS. The INTERDATA BUS uses the "handshaking" 
principle to reduce the effect of transmission delays. 
Eight bits are used to address peripherals so that 255 
units could be connected to the BUS. A wide variety 
of peripheral devices are connected to the two compu- 
ters. These include standard units such as: a high 
speed paper tape reader/punch, a matrix printer (165 
char/sec), 2 disc drives with a capacity of 1.5x10 
bytes each and 2 selector channels to control high 
speed data transfers. 

Other peripherals are available for real time and 
time-sharing operation: telephone line controller, 
A/D and D/A converters, memory protect which is con- 
trolled as an I/O device and a high precision program- 
mable clock. 


3.A NEW CONTROL STRUCTURE FOR DYNAMIC MICROPROGRAMMING 


The standard INTERDATA has its control microprogram 
stored in a magnetic read-only memory, MMF, wired at 
the factory. To allow dynamic microprogramming expe- 
riments, a writable microprogram memory, MMA, was ad- 
ded to the system. This section first describes how 
instruction fetch and execute is carried out by the 
INTERDATA; this will show the extent to which machine 
behaviour can be modified by microprogramming. Then 
the strategy presently employed to make simultaneous 
use of both MMF and the new MMA will be given [18,19]. 


3.1. Instruction processing: The INTERDATA is not a 
"pure" microprogrammed computer. The microinstructions 
are short (16 bits) and highly coded. Each microins- 


truction has a 4 bit op-code which determines, along 
with internal status registers, the meaning of the 
other bits in the instruction. A microprogram location 
counter determines the next microinstruction to be exe- 
cuted. The normal FETCH-DECODE-EXECUTE cycle of the 
INTERDATA is shown in Figure 1. Some frequently used 


functions of this cycle, steps 2, 4 and 6, have been 


speeded up through the use of special hardware. These 
functions which are used by the DECODE microinstruction 
are mainly concerned with the interpretation of the op- 
code part of user 'macroinstructions", Naturally, they 
make certain implicit assumptions about instruction 
format and the structure of the op-code. For example, 
in step 2, the DECODE microinstruction determines the 
instruction type and causes a branch to the appropri- 
ate microroutine to fetch the operands. Later, in 

step 4, use is made of a special high-speed decoding 
read-only memory or DROM. When provided with an op- 
code, the DROM returns the address of the corresponding 


* 
1. Next instruction 
fetch 
** 


2, Decoding of Instruction 
type: RR,RS,RX or RI. 


LOC= LOC+i 


* 

3. Preprocessing: fetching 
of operands 

Rk 

4. DROM Interrogation for 
Micro-routine address 

* 

5. Processing of operands 


6. Test of machine status 
bits (interruptions, 
failure, etc...) 


INTERDATA Instruction format 


oe Te] | 


RS 


Pe Te Te] 


* Under microprogram control 


RI 


** Hardware operations (Micro instruction "Decode'') 


Instruction execution and formats 


Figure 1: 


EXECUTE microroutine. Any op-code which is not imple- 
mented in the standard INTERDATA results in a branch 
to an "Illegal Instruction" microroutine which in turn 
causes an interruption in the user programme. 

The microprogrammer is therefore faced with a trade- 
off between speed and flexibility. If a standard ins- 
truction is replaced by a new one with the same format 
but different meaning, he can use the DECODE hardware 
to get fast execution. On the other hand, he can mi- 
croprogram without using DECODE and obtain complete 
freedom in the interpretation of macroinstructions. In 
this case, the DECODE phase of the instruction proces- 
sing will be lengthened. However, if the new instruc- 
tions are fairly sophisticated and have long EXECUTE 
times, the penalty incurred in bypassing the DECODE 
hardware will be minimized. In the near future, we 
plan to replace the DROM by a writable decoding memory 
to regain some speed while retaining full flexibility. 


3.1. The writable control memory (MMA): The writable 
memory is a thin film memory with non-destructive read- 
out obtained from Memory Systems Inc. of California. It 
has 1024 16-bit words with 400 n sec access time. It 
is slightly slower and smaller than MMF which contains 


2K words with 300 n sec access. The speed difference 
does not affect proper operation of the machine and 
the size difference is not critical since the standard 
microprograms occupy only 35% of MMF. The MMA is con- 
nected to the INTERDATA in two different ways: (a) 

It is connected through a controller to the Multiple- 
xer I/O BUS and can be used as a fast external memo- 
ry accessed by the standard I/O commands - this is the 
way in which user microprograms are introduced into 
MMA; (b) It is also connected to the internal regis- 
ters of the machine and works in parallel with MMF. 
When the INTERDATA requires a new microinstruction, 
both MMA and MMF respond and try to place their out- 
put into the internal data register. Which memory 
word is selected is decided by an electronic switch 
which is controlled by MMA's I/O controller. Initial- 
ly, MMF is connected but control can be passed to MMA 
by a special output command to MMA. 

This system is very flexible. The INTERDATA can run 
under exclusive control of either memory or control 
can be passed from one to the other during execution. 


3.3. Changing the instruction set: Here two possibili- 
ties must be considered: (a) The replacement of all 
the old instructions by a completely new instruction 
set and (b) The addition of a few new instructions to 
the already existing set. ; 

In the first case, the complete new microprogram is 
loaded into MMA through the I/O BUS under control of 
MMF, Control is then passed to MMA by an output com- 
mand. 

The second case is more complex since, ideally, the 
existing firmware in MMF should be used for the old 
instructions and control should be given to MMA only 
for the new instructions. In this way the limited 
space of MMA is not wasted by duplicating the firmware 
already in MMF. This mode cf operation gives rise to 
two problems: a) deciding whether an instruction is 
"old", "new" or "undefined'' and b) finding the address 
of the microroutine in MMA corresponding to a "new" 
instruction. In MMF, this second problem is resolved 
through the use of the DROM. In our system, the first 
problem is partially solved through the use of the 
"illegal instruction" microroutine already in MMF. When 
an undefined instruction is encountered, the standard 
firmware causes an interrupt and stops executing the 
current program to branch to a monitor routine located 
at a predetermined address in core. This process is 
accomplished efficiently through the exchange of Pro- 
gram Status Words (PSW's). The action of the monitor 
routine depends on the operating system; in our BOSS 
supervisor, it prints an error message and aborts the 
user job. "New'' instructions are treated by MMF as il- 
legal instructions and it is a simple matter to modify 
the monitor routine so that it transfers control to 
MMA before printing the error message. To decide whe- 
ther the instruction is "new'' or illegal, the MMA 
microprogram uses a table in core with an entry for 
each possible op-code. In the case of a "new" instruc- 
tion, the entry contains the address of the required 
microroutine; for an undefined instruction, it contains 
a special indicator. In the first case, the proper 
microroutine is executed and the program location coun- 
ter is set to point to the next instruction in the user 
program. Then MMA issues an output command to return 
control to MMF. In the second case, control is retur- 
ned to MMF immediately without altering the location 
counter so that the monitor routine can resume execu- 
tion. This simultaneous use of MMA and MMF is summa- 
rized in Figure 2. 


3.4. Conclusions: As indicated by Rosin et al. [13], 
the choice of an inexpensive host machine for dynamic 
microprogramming is severely limited. We have shown 
that with a few relatively inexpensive modifications, 


the INTERDATA 4 can be made suitable for microprogram- 
ming experiments. The resulting system does not have 
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Figure 2: Simuitaneous use of MMA and MMF 


the full flexibility of a machine designed specifically 
for the purpose and the main deficiencies are: 

(a) The amount of parallel internal operation is limi- 
ted due to the highly coded format of the microinstruc- 
tions; 

(b) There is no support for subroutine linkage at the 
microprogram level; 

(c) Addressing is restricted by the division of the 
micromemory into 256 word blocks; 

(d) The instruction decoding mechanism (DROM) is not 
alterable dynamically. This, however, will be modified 
shortly. 

In spite of these deficiencies, the present system is 
adequate for most of our needs: virtual memory manage- 
ment is being implemented through microprogramming, a 
new instruction set for LISP programs is in the design 
stage and fast instructions for waveform synthesis have 
been microprogrammed. 


4. INPUT/OUTPUT MODIFICATIONS 


Although the laboratory has two mini-computers, each 
with its own teletype, the other I/O units have not 
been duplicated: for example, there is only one high 
speed paper tape reader-punch and only one printer. 

The present I/O capacity is sufficient for both machines 
but it is often necessary to transfer units from one 
computer to the other. To simplify this procedure, a 
programmable I/O switch (PIOS) has been designed allow- 
ing the units to be shared between the two systems. 
This PIOS is relatively simple and its low cost (under 
$500) is compatible with the rest of the system. 


4.1. INTERDATA I/O channels [16, 17]: As shown in 


Figure 3, the INTERDATA does I/O in either of two 
ways: SB MB 
(a) Directly via the multiplexer Bus (MB), misleading- : 
ly called channel by INTERDATA. The INTERDATA instruc- 
tion set includes several instructions to transfer in- 
formation along this BUS. Although up to 64 K bytes 
can be transferred with one instruction, it is impor- 
tant to note that the CPU is fully occupied by the 
transfer and cannot execute any other instructions in 
parallel. 

(b) Through an optional Selector channel. This device 
is controlled via MB. It has a direct access to the 
memory and can transfer data between memory and units 
on the selector BUS (SB). After a transfer has been 
initiated, the channel works in parallel with the CPU. 
However, the parallelism is limited in that the chan- 
nel does not fetch Channel Command Words from memory 
and each transfer must be initiated by the CPU. 


Selector channel 


Figure 4: PIOS - General Structure 


hardware "'reservation" procedure has been built into 
PIOS. This hardware does not make the system "idiot- 
proof" and ways in which the system can be misused will 
be shown later. The reservation mechanism, however, 
makes it quite easy for the supervisor routines hand- 
ling I/0, to avoid deadlock and excessive lock-out. 
This is in keeping with the general philosophy of the 
laboratory to integrate hardware and software design. 


CORE 
MEMORY 


HIGH SPEED MEMORY BUS (HALF WORD) 


Paes ene gee Hardware verification of possible errors would more 
CHANNEL hike a ze than double the PIOS hardware with no increase in over- 
clatalall a , all efficiency. To implement the reservation mechanism 


PIOS maintains two registers, one for each system, 
where each bit corresponds to a unit on the common 
buses. These registers are consulted whenever a re- 


(KALF WORD) 
MULTIPLEXOR 
BUS (BYTE) 


SELECTOR 
BUS (BYTE) 


Ned quest is made to PIOS. PIOS accepts two types of re- 
. quest which operate as follows: 
| (a) Reservation request - Here a system provides one 
— | Device | . word of information indicating the unit or units it 
wishes to reserve. The word has the same format as 
| the internal reservation registers. This word is com- 
. pared with the reservations made by the other system 
and if conflicts arise, the request is rejected; other- 
CONTROLLER ue wise, the request is accepted and the first system's 
reservation register is updated; 
DEVICE (b) Connection request - Although connection is made 


to the bus and not to individual units, this request 
must indicate the unit it wants to access. This is 

; done in the same way as a reservation by providing one 
Figure 3: Systems Interface - Block Diagram word with the appropriate bit "ON". Only if the unit 
has previously been reserved and the bus is free, is 
the connection made. ; 

In this system, a bus, once connected to one of the 


The INTERDATA also has an interrupt mechanism similar computers, remains connected until the computer re- 
to the IBM/360's which can be used to eliminate "busy leases it. There is also no check made by PIOS upon 
waiting" on the part of the CPU. the use the computer makes of the bus. To cancel re- 
servations or to disconnect a bus, a request is sent 
4.2. The programmable I/O switch (PIOS): In our sys- with a null information word. If the common units are 
tem, shown in Figure 4, each bus has been divided into to be shared properly, a certain protocol must be ob- 
three sections. Each computer has exclusive use of served by both systems. This protocol is best explain- 
one section and the third can be connected to either ed with an example. Assume that computer A wishes to 
system under control of PIOS. It is on the third seg- use the printer connected to the shared multiplexer 
ments, SB and MB, that the shared units are connected. bus. At the same time, computer B is using the paper 
SB and MB are controlled separately. Manual switches tape reader-punch on the same bus. The necessary ope- 
on PIOS can force connection to either system or leave rations for computer A are shown below: 
the connection to program control. It should be noted 1. Start of job; 
that switching applies to the buses and not to indivi- Reserve printer; 


dual I/O units. | - 
Requests for the use of the shared units are sent to - 


PIOS via OUTPUT COMMAND and WRITE instructions along For each output to the printer the following 
MB1 or MB2. The result of a request is indicated by 3 step sequence should be observed: 

the status bits of PIOS which can be tested from ei- 2. Connection to MB by requesting printer; 

ther system through SENSE STATUS instructions. Cor- . Output to printer; 


3 
rect sharing of the common units requires cooperation 4. Disconnect MB; 
between the two systems. To simplify the process, a - 


N-1. Cancel printer reservation; 
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N. End of job; 
By reserving the printer for the duration of the job 
we ensure that computer B output will not be mixed with 
computer A's. By disconnecting the bus (step 4) after 
each output, we allow computer B to access the punch. 
The process works well only as long as the two systems 
cooperate. If one system neglects to disconnect the 
bus, the other will be locked out. Also, since PIOS 
does not check the activity on the bus, it is possible 
to bypass the reservation system by requesting connec- 
tion to one unit and doing I/O to another. In step 3 
of the example, computer A could have used the punch 
reserved by computer B. However, the required cooper- 
ation can easily be enforced by appropriate supervisor 
software in both systems. 
The problem of deadlock can also be avoided by soft- 
ware. Deadlock commonly occurs when: system 1 has X 
and wants Y, and, system 2 has Y and wants X. The so- 
lution makes use of the reservation procedure and pro- 
ceeds in two steps: (1) cancel all reservations, and, 
(2) reserve all units required with a single request. 


4.3. Conclusions: A programmable I/O switch has been 
designed and built to allow two mini-computers to share 
I/O units. The I/O switch makes use of a novel hard- 
ware "reservation" scheme to facilitate cooperation 
between the two computers. The complexity (120 inte- 
grated circuits) and cost (* $400) of the I/O switch 
are low, in keeping with the rest of the system. The 
"reservation" hardware is not meant to prevent I/0 
programming errors; but it does make it easy to write 
supervisor software to ensure correct operation. The 
I/O switch is modular in design and it is planned to 
add more features to it: in particular, a channel to 
channel communication facility to enable the two com- 
puters to exchange data efficiently. Research is also 
being done to find a foolproof scheme to handle inter- 
ruptions from the shared units and direct them to the 
proper system. 


5. SUMMARY 

In spite of its short existence, the hardware labora- 
tory has proved to be a very useful research facility. 
The flexible computing ''test-bench" described in this 
paper is already being employed by several projects. 
One project in the field of graphics is of special in- 
terest since it uses most of the system's facilities: 
it involves the design of a hardware vector generator 
driven by software and firmware routines. Two major 
software projects are under way: WADOCH, a real-time 
operating system and a compiler for the PASCAL lan- 
guage. To run efficiently, these programs require more 
memory than presently available on our machines. This 
should be solved by the next major improvement to the 
computing system: the addition of virtual memory. At 
first, the address modification will be done entirely 
by microprogram but later associative registers will 
be added to speed up the process. 

The current projects are not only from the systems 
section of the department. Two numerical analysts are 
studying the possibility of microprogramming special 
instructions for interval arithmetic [10] and a group 
in artificial intelligence is building an adaptive 
learning network to be controlled by the INTERDATA. 
The laboratory has also been invaluable for a course 
in Computer Architecture given to undergraduate stu- 
dents. 

By basing our computing test-bench on standard mini- 
computers and peripherals, the investment required to 
set up the laboratory was modest: $110,000 divided 
into $85,000 for the computing hardware and $25,000 
for electronic equipment and tools. The flexibility 
required for research was achieved through the two mo- 
difications described in this paper. The main charac- 
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teristic of the laboratory is that it permits integra- 
ted research into three related areas: software, hard- 
ware and firmware. This approach has already proved 
fruitful and should remain successful in the future. 
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ABSTRACT* 


In a case studies approach to computer architecture education, 
there is a need for small-scale simulation exercises to illustrate 
significant concepts and to provide hands-on student experi- 
ence with architectural tradeoffs. Two such exercises are 
discussed, and one is described in some detail. The exercises 
cover virtual memory and multiprogramming systems! architec- 
ture, and are suitable as projects students can do within a ten- 
week academic quarter. Some hindsight based on student 
reaction to these exercises is provided, together with estimated 
costs to the educator and students for exercise development and 
execution, 


I, INTRODUCTION 


A case studies approach to computer architecture education has 
a number of advantages, one of which is that there exists today 
an excellent text in support of such an approach,\'/ However, 
a disadvantage of the approach is the difficulty of going 

beyond computer system overviews and comparisons to ensure a 
solid grasp of significant concepts and to develop a student's 
feel for actually doing computer architecture. One way to 
accomplish these ends is to supplement the student's diet of case 
studies with a special interest architectural topic to be explored 
in some depth during the term. Examples of such topics are 
virtual memories, multiprogramming, parallel processing, 
higher level language architectures, etc.; and computer simu- 
lation can be a powerful tool for in-depth exploration of these 
topics if simulation exercises can be defined which are con- 
sistent with the various constraints of an academic environment. 


Ideally, simulation exercises for these tutorial purposes should 
be comprehensive, realistic, computationally feasible and 
student compatible. 


A well-designed simulation exercise can be expected to improve 
a student's understanding of important architectural concepts 
and his feel for the impact of architectural tradeoffs in varying 
computer system environments. It is interesting that even the 
(inevitable) imperfections of the models used in the simulation 
exercises can serve as tutorial assets, because an imperfect 
model can serve as a "straw man" and provoke student thought 
as to what form a better model should take. 


The work reported on here was done when the author was Visitin 


Wright State University, Dayton, Ohio, 9/72-6/73. 


This paper is based on the author's experience with two small- 
scale computer simulation exercises which have been used in 
support of case studies type computer architecture courses taught 
by the author at Wright State University. The exercises cover 
virtual memory and multiprogramming systems! architecture, 

and each exercise was developed and executed as a supplemen- 
tary project during a ten-week academic quarter. Because it 

is assumed that the primary audience for this paper is computer 
architecture educators, the emphasis of the paper is objectives 
and strategy for both exercises, and only the virtual memory 
exercise is described here. Additional details on both exercises 
are available in‘) for those interested. 


Hl. GENERAL OBJECTIVES, STRATEGY AND COSTS 


Since the general objectives and strategy were the same and 
costs were similar for both exercises, it is most efficient to cover 
these matters in a separate section. Specific objectives and 
strategy will be covered later in sections dealing with the indi- 
vidual exercise particulars. 


General Obj ectives 


Some of the general objectives of both exercises can be com- 
pactly described with the aid of formulas, as follows: Let 


Y; = value of the ith system performance 

figure of merit (1) 
x = value of the performance or capacity 

of the ith system component (2) 
Ww = system workload or job environment (3) 
P = system price | (4) 
Y; = Xys Xoe --Xqe W) (5) 
Y,/P = _ system performance/price with respect 

to the ith system figure of merit (6) 
OY. 
3x OF sensitivity of the ith system performance 

j index to changes in performance of its 
jth component (7) 
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OY. 

a0 = sensitivity of the ith system performance 
index to changes in system workload (8) 

OY, 

>" = sensitivity of the ith system performance 


index to changes in system price (9) 


Then the general objectives of both exercises can be succinctly 
stated as follows: 


1. To improve a student's understanding of the architectural 
concept being explored by having him construct a model 
of a system which incorporates this concept; i.e., by 
having him write a program to implement the (F.). 


2. To improve a student's understanding of the behavior of a 
system based on this concept by having him run his model 
with varying component values and workloads, and having 
him observe associated variations of performance indexes, 
i,e., by having him determine various values of 
OY; oY; 


Ox, OW 


3. To improve a student's understanding of system performance/ 
price and variations thereof by having him include price 
data explicitly in the above calculation; i.e., by having 

e e - i i 
him determine values of 5 and ap: 


General Strategy 


In the context of these objectives, the general strategy took 
the form of the following seven steps: 


Instructor's Role 


1. Decide on a suitable special interest architectural topic 
for in-depth exploration during the quarter. 


2. Provide a specific description of model inputs and outputs. 


3. Provide a loose description of the model to be implemented, 
together with details of configuration, component per- 
formances, and component prices. 


4, Provide background information directly relevant to the 
selected topic in whatever form it may be available 
(e.g., papers, lectures, movies, etc.). 


5. Provide loose guidelines regarding the desired form and 
coverage of the report to be prepared describing the 
exercise. 


Student's Role 


6. Write simulation program implementing the model in the 
language of his choice (usually Fortran) to be run on the 
computer of his choice, and exercise the model with 
varying component parameters and varying workloads. 


7. Prepare a report describing the simulation experiment, 
interpreting the results, and critiquing the experiment, 
including recommendations for its improvement. 
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In the case of both exercises, as conducted at Wright State 
University, the main criteria for the instructor's decision in 
step 1 were significance and timeliness of the topic in the light 
of contemporary computer architecture developments, and com= 
patibility of the topic with the case studies to be conducted 
concurrently, The former criterion tended to guarantee rele- 
vance and an ample supply of current background material; 
thus it eased execution of step 4, while at the same time 
increasing student interest. In step 2, it was necessary to 
specify the interface between model and its inputs and outputs 
early and quite precisely, so that students could get on with 
the matter of model building as soon as possible in the quarter. 
The model descriptions given in step 3 were quite loose, 
thereby relieving the instructor of a potentially heavy burden 
while at the same time increasing student benefits from the 
exercise by requiring them to do their own analyses of fairly 
complex problems. The report guidelines of step 5 were 
necessary because of the low average level of student experi- 
ence in architectural analysis, but the guidelines were kept as 
loose as possible to encourage independent thought about the 
simulation experiments themselves. 


Exercise Costs 


Each exercise was presented and explained to students on a 
piecemeal basis rather than all at once, i.e., the various 
necessary descriptions were introduced by portions of lectures, 
handouts, etc., over the 10-week course period. The cost to the 
instructor of running the exercises was about 20 percent of avail- 
able lecture time, plus the time required to develop and pub- 
lish descriptive materials in the form of handouts (examples of 
the latter are given in{2)), 


Other costs of running the exercises are the amount of student 
time required, and the amount of computer resources used. 
These costs, of course, vary widely, depending greatly on 
student's modeling and programming skills. One of the better 
students in the class gave the following estimates of these costs 
(programs in Fortran, run on a PDP-10): 


Programming Time Program Run Time 


10-12 hours 
10-12 hours 


7 minutes 
15 minutes 


Virtual Memory Exercise 
Multiprogramming Exercise 


However, the above figures are for two of the most sophisti- 
cated models developed, and so the run time figures are not 
typical of the class as a whole. The following program run time 
figures are more typical of student models: 


Program Run Time 


15 minutes on IBM 1130 
5 minutes on IBM 360/65 


Virtual Memory Exercise 
Multiprogramming Exercise 


Hf. VIRTUAL MEMORY SIMULATION 


The virtual memory exercise was described to students in a 
series of partial lectures, a movie and handouts, with the latter 
containing configuration, performance and price data and the 
former serving to explain the virtual memory concept and the 
associated handouts. Because the IBM 1130 computer system 
was the only computer system with which all students in the 
class were reasonably familiar, and because the question of 
applicability of the virtual memory concept to small-scale 


computer systems is an interesting one, the exercise took the 
form of a feasibility study for adding a virtual memory to a 
small IBM 1130 system. 


Specific Objectives and Strategy 


Specific objectives can be obtained from the general by 
assigning particular meanings to the performance figures of 
merit, etc., given in Section II, equations 1-9. This is done 


next. A fixed-page-size type of virtual memory system is 
assumed. 

Y = number of memory accesses per second 

X, = page replacement algorithm 

X, = page size (words) 


real memory size (words) 


= address stream 


P = system price 
ae F(X, Xor Xgr W) 
f = mathematical model of the virtual memory system 


for these purposes 


Essentially, in this exercise, it is assumed that an 1130 user 
has a choice between an 1130 system with a real memory of 
64K 16-bit words, and a Virtual Memory (VM) 1130 system 
with a virtual memory of 64K words and a real memory of less 
than 64K words plus an address mapping device, called a VM 
box, with a certain assumed price and performance. Price 

and performance data used was that from IBM 1130 price lists 
and manuals for standard system components. All price data 
was reduced to equivalent dollars per month, and in cases 
where only an actual or estimated selling price was known, 

an estimated dollars-per-month figure was obtained by dividing 
the selling price by 40. In cases such as that of the VM box, 
an estimate of cost and performance was made and the selling 
price was obtained by assuming a 4:1 ratio of selling price to 
cost, this being quite typical in the computer industry. Details 
such as this provided an opportunity for the instructor to point 
out to the student the important distinction between price and 
cost in computer architecture studies. As a result of the above, 
it was estimated that a 4K VM=1130 system could reasonably 
rent for about $2500/month, while a 64K non-VM-1130 could 
reasonably rent for about $5300/month, with both offering the 
user 64K words of addressable main memory. Furthermore, it 
was estimated that the VM=1130 system could have additional 
4K word increments of real memory for about $205/month each. 
Thus, the stage was set for examining the performance and 
price/performance consequences of the VM-1130. with the aid 
of simulation. 


Model Inputs and Outputs 


The particular inputs and outputs are loosely defined in the pre- 


vious section. More specifics of these were described in 


* FIFO = First In First Out, LRU = Least recently used, 


student handouts, with the understanding that students could 
deviate from these specifics if such deviations seemed desirable 
on the basis of preliminary experiment results. Students were 
advised to use the following parameter values initially in 
exercising the model: 


X, (page replacement algorithm): FIFO, LRU, PA* 


Xo (page size, words): 16, 32, 64, 128, 256, 512 


Xy (real memory size, words): 4K, 8K, 16K, 32K 
W (address stream): AG-1, AG=2, AG=3A 


Most of these inputs are self-explanatory, except that the PA 
page replacement algorithm is not as widely known as the other 
two, and the synthetic address stream (W) was invented for 
purposes of this exercise. The PA algorithm (3) is essentially 
one wherein information as to whether or not a page in main 
memory has been altered in the course of references made to it. 
If a particular page has not been altered, then when that page 
is ousted from main memory it is unnecessary to write it back in 
the disk supporting the virtual memory scheme because that disk 
already contains an identical copy of the page in question. 
Thus, the saving of this type of page status information can 
lead to a substantial reduction in the number of disk accesses 
required, and therefore increases virtual memory performance. 


The virtual memory system model is driven by a synthetic 
address stream, which is essentially a sequence of numbers on 
the virtual memory address space (0,65535) obtained by suit- 
able modifications of a random number generator. Address 
generators AG-1, AG-2 and AG-3A are simple empirically 
derived address streams which are intended to be statistically 
similar to various possible actual address streams. Roughly 
speaking, they may be characterized as a severe thrasher, a 
moderately severe thrasher, and a moderate thrasher, 
respectively. There is, of course, no implication that these 
labels are valid relative to real-world "typical" address 
streams, since it is not known fo this writer what these real= 
world entities actually are. The generalized algorithm for all 
address generators is given below: 


Generalized Address Generation Al gorithm 


Step 1. Define address space = S = (0, MAX) 


Step 2. Select starting address randomly on S = A, 


Given address A, , select Al 4. 


@ with 50 percent probability, A ‘17 A +1 


Step 3. 125 follows: 


e@ with 25 percent probability, A 217 fi (0, A) 
@ with 25 percent probability, 
Al +17 fo (Ai, MAX) 


Step 4. Return to Step 3. 


In the above, f} and f> are functions which map the address 
domain into itself, The various address generators differ in the 


particular form of the functions fi. 


PA = Push Alter. In the use of PA, it is assumed that 90 percent of the time a page is not altered when referenced. See (3) 


for further details. 


Model Description 


The model used to simulate the virtual memory system is quite 
simple, and is given in Figure 1. The numbers used in the 
bottoms of the flow chart boxes indicate the time required for 
the system being simulated to execute the indicated step. It is 
assumed that the VM box can determine whether or not the 
desired page is in memory in the time of 1 ys, and that it can 
execute the desired page replacement algorithm and perform 
housekeeping (update page tables, etc.) in an additional 1 us. 


The value of the PA algorithm is clearly revealed by this model, 


since, in this case, block 4 can frequently be bypassed. 


ADDRESS | 
GENERATOR 


DETERMINE PAGE 
TO BE OUSTED AND | 


UPDATE PAGE TABLES FETCHA 
1 MSEC 3.6 USEC 


NOTES: 
@ “A” MEANS DESIRED VIR- 
TUAL ADDRESS IN (0,65535) 


‘@) “PT’’ MEANS AVERAGE TIME 
TO READ OR WRITE A PAGE 


STORE OUSTED PAGE 
ON DISK IF ee same 


FETCH NEW SPE elok 
PAGE FROM DISK = (132.5 + .03P) mSEC 


P = PAGE SIZE IN NUMBER 
OF 16 BIT WORDS 


Figure 1. Virtual Memory System Model 


Results and Discussion 


Some simulator outputs for varying core sizes, page sizes, page 
replacement algorithms and address generators (program envi- 
ronments) are given in Table 1.* The virtual memory figure of 
merit was taken as the number of main memory accesses per second 
The LRUA algorithm indicated in Table 1 was invented by the 
student whose results these are, and it uses a combination of 
the basic concepts of the LRU and the PA algorithms. 


For scientific research purposes, the experiment, itself, clearly 
needs much more refinement. Nevertheless, these crude results 
are both interesting and provocative. Consider, for example, 
the following sample observations: 


@ Performance varies widely as simulation parameters 
are varied, from a low of about 5 to a high of about 
313 memory accesses per second. 


@ The ATLAS algorithm is consistently one of the worst of 
the five algorithms examined. 


e@ The more random the address stream, the smaller the 
page size should be unless the ratio of real memory 
size to virtual memory size is large (say, >0.5). 


e@ Ofall the algorithms, the ATLAS algorithm seems 
least sensitive to page size. 


The provocative nature of the results is illustrated by the fact 
that students tended to do much more with the simulation than 
was required. For example, in Table 1, the data for LRU, 
FIFO, and PA algorithms was required, but the LRUA and the 
ATLAS algorithms were investigated on the student's own 
initiative. 


In conclusion, the virtual memory simulation exercise accom= 
plished three main objectives, namely: 1) complement the case 
studies approach to computer architecture; 2) improve student 
understanding of the virtual memory concept; and 3) improve 
student feel for tradeoffs associated with implementation of 
that concept. 


IV. SUMMARY AND CONCLUSIONS 


A general strategy and objectives for simulation exercises use 
ful in a computer architecture education have been described. 
Some costs associated with implementation of two such exercises 
have been discussed, and one of these exercises has been 
described in some detail. That exercise concerned the virtual 
memory concept, and it was found suitable for use as a quarter 
project for senior or first-year graduate level students. It was 
designed to complement a case studies approach to computer 
architecture education. 


Student feedback from the exercise indicated that it resulted in 
an improved grasp of the associated concepts, and an increased 
interest in the course itself, Better initial definition of the 
exercises and, in some cases, better student background in both 
computer architecture and simulation itself would probably have 
resulted in increased benefits to students, but to some extent 
the evolutionary nature of the exercise development increased 
student participation in the modeling process itself, and the 
benefits from this tend to offset the disadvantages of the piece- 
meal, evolutionary approach actually used. 
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Table | 
Virtual Memory Simulation-Some Results* 
Virtual Memory Figure of Merit (Memory References per Second) 


CORE PAGE AG-1%* AG-2 AG-3A 
SIZE SIZE 


LRU FIFO PA _LRUA ATLAS LRU FIFO PA _LRUA ATLAS LRU FIFO PA _LRUA _ATLAS 


4K 16 24 24 25 48 9 22 21 23 44 8 11 11 20 23 11 
32 24 23 25 47 9 23 Zz 25 45 9 12 12 22 Z2 12 

64 8 8 25 15 9 9 9 26 17 9 13 13 22 22 13 

128 8 8 16 14 9 9 19 16 9 15 15 25 25 14 

256 8 8 16 15 8 8 8 18 15 9 Zz 22 41 34 22 

512 8 8 16 14 8 9 9 18 15 3 “30 29 57 42 29 

1024 8 7 14 13 7 8 8 17 15 8 33 32 63 44 31 
2048 6 6 11 11 6 ti 7 15 12 7 34 32 64 43 33 
4096 5 5 9 8 5 6 6 11 10 6 28 28 56 35 28 

8K 16 28 28 27 56 10 26 25 26 51 9 13 13 25 27 14 
32 28 28 27 We 10 28 27 28 55 10 14 14 27 28 15 

64 29 27 27 57 10 29 26 30 58 10 14 14 25 28 16 

128 7 8 26 16 10 EL 15 30 20 10 16 16 27 29 18 

256 9 9 Ly 16 9 10 10 21 18 11 25 25 46 40 26 

512 9 10 18 16 9 11 10 21 18 11 33 31 64 50 35 

1024 9 9 16 15 9 10 9 20 16 9 41 40 73 58 37 
2048 8 7 13 13 7 8 8 17 14 8 43 43 86 56 38 
4096 5 6 12 10 5 7 7 15 11 6 54 34 67 43 34 

16K 16 44 40 36 87 16 42 40 39 84 15 17 17 31 34 17 
32 41 40 36 81 15 42 41 39 83 15 20 20 34 40 19 

64 41 40 35 83 15 44 41 38 89 15 21 21 31 43 20 

128 42 37 33 83 14 43 38 39 85 15 25 24 33 50 22 

256 13 31 34 23 13 42 34 38 85 15 33 35 55 64 32 

512 14 LZ 33 25 13 17 15 44 28 16 47 44 87 80 43 

1024 13 13 24 22 12 15 14 36 25 15 59 57 113 88 48 
2048 10 11 19 17 10 12 13 26 19 12 61 68 117 83 54 
4096 8 8 16 13 7 10 10 24 16 10 52 44 98 61 44 

24K 16 75 ZZ 5S 150 27 157. 134 107 313 55 28 28 48 55 27 
32 72 67 55 144 26 144 125 107 288 55 33 33 53 66 30 

64 74 66 55 149 26 143 109 106 286 55 30 30 46 61 30 

128 73 63 52 147 25 115 96 105 E29 54 41 40 46 81 33 

256 71 58 53 143 23 127 71 102 254 56 51 50 78 102 54 

512 80 50 57 161 20 113 65 154 225 68 80 77° =113 161 60 

1024 28 29 42 46 22 44 36 133 73 43 109 96 161 204 73 
2048 7 v4 37 29 18 38 25 59 53 34 86 86 215 136 81 
4096 20 16 32 30 20 17 22 41 25 Zi 89 98 163 115 70 


* Address generators produced address streams on (0,32767) 


** LRU = Least Recently Used PA = Push Alter ATLAS = Replacement Algorithm 
FIFO = First In First Out LRUA = Least Recently Used Alter Used for Atlas 
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COMPUTER ARCHITECTURE COURSES IN 
ELECTRICAL ENGINEERING 
DEPARTMENTS 
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Department of Electrical Engineering 
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Houghton, Michigan 


ABSTRACT 


This paper traces the history of computer architecture 
courses in electrical engineering departments. Pre- 
viously unpublished data from the Fall 1972 COSINE 
survey are given to show current computer architec- 
ture course offerings and texts. Computer architec- 
ture courses offered in 1972-73 are analyzed, com- 
pared with ACM and COSINE recommendations, and 
classified into five categories: introductory computer 
engineering courses with a computer architecture fla- 
vor, software-oriented computer organization courses, 
hardware-oriented computer organization courses, 
case study courses, and topical seminars. Future 


trends in computer architecture education are predicted. 


INTRODUCTION 


Computer architecture (or computer organization) is in- 
tended in this paper to correspond to the definition 
given by Foster (Jan. 1973): 


Computer architecture embraces the art and 
science of assembling logical elements into 

a computing device. As normally conceived 

of a computer architect accepts from a logi- 
cal designer units such as stacks, memory 
blocks, and tape drives and puts them to- 
gether so that they form a computer and turns 
this over to a systems programmer who then 
constructs an operating system for the machine. 


Computer architecture courses comprise a major and 
rapidly growing division of computer engineering 
courses taught in electrical engineering departments. 
The 1972 COSINE survey of U. S. and Canadian elec- 
trical engineering departments (Sloan, Coates, and 
McCluskey, 1973a and b) found that the two COSINE- | 
recommended computer architecture courses were 
taught at more schools than any other COSINE-recom- 
mended computer engineering courses except for intro- 
ductory programming, introductory switching theory 
and logic, and numerical analysis. Ninety per cent of 
the responding schools taught both computer architec- 
ture courses, although nearly half taught one or both 
outside of electrical engineering. 


HISTORY OF COMPUTER ARCHITECTURE COURSES 


The early history of computer architecture courses is 


difficult to trace. A survey by Cook in 1963 showed 
that only two EE departments surveyed taught three or 
more computer courses and only six taught two or 
more in his sample of major engineering schools, 
granting nearly half of all ECPD-accredited bachelor's 
degrees. This survey and a perusal of major engineer- 
ing school catalogs from the late 1950s and early 

1960s suggest that few EE departments offered sepa- 
rate courses in computer architecture much before 
1965. 


Table I, adapted from the Fall 1972 COSINE survey, 
traces the adoption of the two COSINE-recommended 
computer architecture courses, called by COSINE 
Machine Structure and Machine Language Program- 
ming and Computer Organization; (descriptions of both 
courses appear in the appendix with descriptions of 
ACM-recommended computer architecture courses). 
Although COSINE had intended Machine Structure and 
Machine Language Programming to be prerequisite to 
Computer Organization (COSINE, Jan. 1970), EE de- 
partments were quicker to adopt Computer Organiza- 
tion and were more likely by a margin of nearly 20% to 
teach it rather than the software-oriented machine 
structure course. Machine Structure and Machine 
Language Programming were taught in 13% of EE de- 
partments before 1965 and are taught in 55% of EE de- 
partments today; the corresponding figures for Compu- 
ter Organization are 18% before 1965 and 73% today. 


TABLE I 
ADOPTION OF COSINE-RECOMMENDED 
COMPUTER ARCHITECTURE COURSES 


Machine Structure 


and Machine Computer 
Language Organization 
Programming 
Not taught 9.5% 4.9% 
Taught outside EE 35.1 26.8 
First taught in EE: 
Before 1965 12.8 18.3 
1965-66 to 1968-69 20. 9 20. 7 
1969-70 to 1970-71 17.6 22.0 
Since 1970-71 4,1 153 
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CURRENT COURSE OFFERINGS 


The remainder of the data for this paper is drawn pri- 
marily from previously unpublished data from the Fall 
1972 COSINE survey. Of the 151 EE departments re- 
sponding to the survey (67.4% of the 224 U. S. and 
Canadian departments polled), 126 departments (56. 2% 
of the departments polled) provided varying degrees of 
information on their course offerings ranging from 
listing of titles or notation of texts to catalog descrip- 
tions and complete course outlines. These depart- 
ments gave 47 different titles for courses which they 
identified as computer organization courses. Compu- 
ter Organization, Computer Architecture, and Digital 
Computer Organization were the most popular titles, 
but misleading titles such as Programming Principles 
and Introduction to Information Structures and opaque 
titles such as Computer Engineering II were also used 
to designate computer architecture courses. More 
than one-third of the EE departments teach exactly 
two computer organization courses, and nearly one- 
tenth teach three or more. 


About one-third of the departments surveyed listed the 
texts they intended to use in 1972-73 in their computer 
organization courses. Their responses are shown in 
Tables II and III, showing texts for first computer or- 
ganization courses and for advanced computer organi- 
zation courses (i.e. any courses after the first), re- 
spectively. Booth (1971), Foster (1970), and Gschwind 
(1967) were the most frequently reported texts for first 
computer organization courses; Bell and Newell (1971) 
was by far the most frequently reported advanced text. 
Some overlapping of texts can be noted with five texts 
being used for courses at both levels. 


TABLE II 
TEXTS FOR FIRST COMPUTER 
ORGANIZATION COURSES 


Bartee (1966) 
Beizer (1971) 


Digital Computer Fundamentals 
The Architecture and Engineering 
of Digital Computer Complexes 


Bell and Newell Computer Structures 


~ (1971) 
Booth (1971) Digital Networks and Computer 
Systems 
Chu (1962) Digital Computer Design Funda- 
mentals 
Chu (1970) Introduction to Computer Organi- 
zation 


Flores (1969) 
Flores (1965) 
Foster (1970) 
Gear (1969) 


Computer Organization 

Computer Software 

Computer Architecture 

Computer Organization and 
Programming 

Design of Digital Computers 
Digital Computer System Principles 
Theory and Design of Digital 
Computers 

Introduction to Digital Computer 
Design 
Introduction to Computer Organi- 
zation and Data Structures 

Ware (1963) Digital Computer Technology and 


Wiener (undated) fie faman Use of Human Beings 
Assorted computer manudts — 


Gschwind (1967) 
Hellerman (1967) 
Lewin (1972) 
Sobel (1970) 


Stone (1972) 


TABLE III 


TEXTS FOR ADVANCED COMPUTER 
ORGANIZATION COURSES 


Bell and Newell Computer Structures 


(1971) 

Chu (1972) Computer Organization and 
Microprogramming 

Flores (1963) The Logic of Computer Arith- 
metic 


Foster (1970) 
Gear (1969) 


Computer Architecture 
Computer Organization and 
Programming 

Design of Digital Computers 
Digital Computer System Prin- 
ciples 

Microprogramming Principles 
and Practice 

Peatman (1972) The Design of Digital Systems 
Assorted computer manuals 


Gschwind (1967) 
Hellerman (1967) 


Husson (1970) 


RECOMMENDATIONS FOR COMPUTER 
ORGANIZATION COURSES 


Two major national groups, the ACM Curriculum Com- 
mittee on Computer Science and the COSINE Commit- 
tee, have made recommendations for computer organi- 
zation courses. A comparison of their courses, des- 
cribed in the appendix, shows much similarity and 
serves as a basis for considering courses actually 
taught in EE departments. COSINE's Machine Struc- 


_ture and Machine Language Programming resembles 


ACM's B2, Computers and Programming, while 
COSINE's Computer Organization corresponds roughly 
to ACM's 13, Computer Organization. ACM also rec- 
ommends an advanced course, A2, Advanced Computer 
Organization. 


ACM and COSINE probably never expected that depart- 
ments would pattern courses exactly after their recom- 
mendations. COSINE (Jan. 1970) noted that they were 
more concerned with recommending topics which should 
be treated somewhere in the curriculum than they were 
with packaging topics into courses suitable for all 
schools. An ACM survey (Engel, 1971) of 26 doctor- 
ate -granting computer science departments showed 
that only 5 offered computer organization courses as 
specified by ACM while 17 offered similar courses. 


CLASSIFICATION OF COMPUTER 
ORGANIZATION COURSES 


The courses actually being taught in EE departments 
in 1972-73 differed from the recommendations in their 
diversity and appeared to cluster roughly into five 
categories: 


1. introductory computer engineering courses with 

a computer architecture flavor; 
2. software-oriented computer organization courses; 
3. hardware-oriented computer organization courses; 
4. case study courses; and 
5. topical seminars. 


i oouerory. computer sels este dye 2 courses with a com- 
puter architecture flavor appear fo be emerging rapidly 
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to meet a need apparently not foreseen by either ACM 
or COSINE. The increasing importance of digital 
technology has made it desirable for all EE under- 
graduates to have a course in digital computers beyond 
the usual FORTRAN or other first programming 
course. Survey courses, combining switching theory, 
machine language programming, computer organiza- 
tion, and sometimes other topics, are being developed, 
usually for sophomores and often as required courses. 
These courses serve to introduce computer engineer- 
ing to prospective specialists as well as to overview 
the area for other engineering students. They can also 
serve as an introductory hardware course for compu- 
ter science students. A suitable text is Booth (1971). 


Software-oriented computer organization courses cor- 
respond roughly to COSINE's Machine Structures and 
Machine Language Programming and to ACM's B2. 
The software emphasis makes the course relatively 
more likely to be taught in computer science depart- 
ments than the other types of computer organization 
courses. This course is taught most frequently at the 
sophomore or junior level and usually includes sub- 
stantial hands-on programming experience with a 
minicomputer. Gear (1969) and Stone (1972) are typi- 
cal texts. 


Hardware-oriented computer organization courses 
comprise the most commonly taught group of compu- 
ter organization courses. They are taught at both the 
introductory and advanced levels. The introductory 
courses correspond roughly to COSINE's and ACM's 
Computer Organization and are usually taught to juniors 
and seniors. Their approach to computer organiza- 
tion is more passive than the advanced courses; em- 
phasis is placed on the student's understanding of the 
way a computer operates rather than on his prepara- 
tion for designing computers. The courses may or 
may not include switching theory and logic design; 
(ACM recommended including logic design topics while 
COSINE recommended that logic design be a prerequi- 
site to the course). The course is often accompanied 
by a laboratory, especially when logic design is in- 
cluded. Chu (1970) and Gschwind (1967) are suited 

for introductory hardware-oriented computer organi- 
zation courses. 


Advanced hardware-oriented computer organization 
courses tend to have a greater design flavor and to 
overlap with the material of courses usually called 
digital systems design. The courses frequently cul- 
minate in student design, usually just a paper design, 
of some digital system, such as a simple minicompu- 
ter. Laboratories and computer simulation languages 
may or may not be included. Foster (1970), Peatman 
(1972), and Hill and Peterson (1973) may be used, but 
the most advanced courses are usually taught from 
notes or the literature. 


Case study computer organization courses are usually 
preceded by one or more other computer organization 
courses and concentrate on comparing organizations 
of several computers. The text, if one is used, is in- 
variably Bell and Newell (1971), but frequently the 
course is taught from computer manuals and journal 
articles, The study is usually primarily descriptive 


although Bell and Newell have contributed to the con- 
ceptualization of the subject. 
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Topical seminars are the least common courses but 
may become increasingly more important as compu- 
ter architecture education continues to expand. These 
advanced seminars usually center on one topic, such 
as microprogramming or memory organization, for 

a term or more and are based on discussion of papers 
from the literature or ongoing research, 


TRENDS 


Trends for the future of computer architecture educa- 
tion are hard to predict because the subject depends 
so heavily on changes in technology. At least for the 
near future a continued growth of introductory courses 
seems assured as more engineers, both electrical and 
other, will need to understand computer organization 
and machine language programming in order to imple- 
ment increased numbers of digital systems. Few en- 
gineers taking computer organization courses are 
likely to design computers; hence greater emphasis 
on interfacing and computer evaluation is needed. 
Perhaps in the long term the architecture of such sys- 
tems will have changed to allow implementation by 
less knowledgeable engineers. 


Growth of advanced courses also seems likely as the 
technology continues to proliferate. As computer 
architecture matures, the courses will become less 
descriptive and more conceptual. The decreasing ex- 
pense and increasing applications of digital components 
will combine to promote greater emphasis on simpli- 
fied interfaces with more awareness of users' needs 
so that, for example, study of computer-assisted in- 
structional systems will emphasize the learner's 
needs resulting in faster response time, better de- 
signed terminals, etc. at the expense of optimal use 
of components. 
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APPENDIX 
ACM 65 - Required Basic Course 
2. Computer Organization and Programming 


Prerequisite: Course 1 above 

Logical basis of computer structure, machine 
representation of numbers and characters, 

flow of control, instruction codes, arithmetic 
and logical operations, indexing and indirect 
addressing, input-output, subroutines, linkages, 
macros, interpretive and assembly systems, 
pushdown stacks, and recent advances in com- 
puter organization. Several computer projects 
to illustrate basic concepts will be incorporated. 
(ACM, Sept. 1965) 


Courses 


Machine Structure and Machine Language 
Programming 


Content. Computer organization model for 
interpreting a machine language, machine 
representation of data and instructions, 
programming in assembly language, I/O 
processes, equipment interrupts, stacks, 
and multiprogramming. 


Computer Organization 


Content. Elements of a stored program 
computer, data representation, algorithms 
for operating on data, arithmetic units, 
control units, memory units, processor 
structures, and selected computer exam- 
ples. (COSINE, Jan. 1970) 


ACM 68 - Computer Architecture Courses 


Course B2. Computers and Programming 


Prerequisite: Course Bl 

Computer structure, machine language, in- 
struction execution, addressing techniques, 
and digital representation of data. Computer 
systems organization, logic design, micro- 
programming, and interpreters. Symbolic 
coding and assembly systems, macro defini- 
tion and generation, and program setmenta- 
tion and linkage. Systems and utility pro- 
grams, programming techniques, and recent 
developments in computing. Several compu- 
ter projects to illustrate basic machine struc- 
ture and programming techniques. 


Course I3. Computer Organization 


Prerequisites: Courses B2 and B3 

Basic digital circuits, Boolean algebra and 
combinational logic, data representation 
and transfer, and digital arithmetic. Digi- 
tal storage and accessing, control functions, 
input-output facilities, system organization, 
and reliability. Description and simulation 
techniques. Features needed for multi- 
programming, multiprocessing, and real- 
time systems. Other advanced topics and 
alternate organizations. 

Course A2. Advanced Computer Organization 
Prerequisites: Courses I3, 14 (desirable), 
and I6 (desirable) 


Computer system design problems such as 
arithmetic and nonarithmetic processing, 
memory utilization, storage management, 
addressing, control and input-output. Com- 
parison of specific examples of various 
solutions to computer system design prob- 
lems. Selected topics on novel computer 


organizations such as those of array or cellu- 
lar computers and variable structure compu- 
ters. (ACM, March 1968) 
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INCREASING HARDWARE COMPLEXITY— 
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ARCHITECTURE EDUCATION 
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Karlsruhe, Germany 


ABSTRACT 


The paper starts with a survey over history and present- 
day situation of educational concepts and design methods 
in computer architecture. Complexity problems, bad de- 
Sign habits, cooperation problems between specialists, 
as well as their changing range of responsibility are 
covered, and the consequences of the developmental 
trends are discussed: now it is time for switching over 
to an integrated teaching of hardware/software design 
methods. The HIM scheme (hierarchy of interpretive mo- 
dules) is suggested as a conceptual machine organization 
framework for modelling the implementation of language 
hierarchies. The application of the HIM scheme for bet- 
ter understanding of semantics, and for a derivation of 
designing guidelines is discussed. 


I. INTRODUCTION 


SOFTWARE COMPLEXITY 


About 5 years ago the slogan “software crisis" was 
coined for a wide variety of problems, caused by in- 
creasing complexity and by a lack of design methodology. 
The most important suggestions to meet those problems 
rely on imposing restrictions to the freedom of design 
decisions in a sense to avoid "tricky program structu- 
res''. The slogan "ego-less" programming has been coined 
(19) for the virtues of the disciplined programmer, who 
is needed to meet the software crisis. It becomes 
apparent, that the software crisis is an educational 
problem (e.g. see 16). 


HARDWARE DESIGN PROBLEMS 


The Hardware/Software "Interface Crisis" 


The introduction of LSI and decreasing hardware cost 
more and more give reason for a discussion on the re- 
placement of pieces of software by hardware. By in- 
creased utilization of LSI capabilities we are going to 
face similar complexity problems in the hardware field 
too. One reason, why the computer architecture communi- 
ty in not yet clearly aware of a "hardware crisis", is 
the existence of cooperation problems: the lack of mu- 
tual understanding between hardware men and software 
men (one might speak of an "interface crisis" - a sub- 
set of the hardware crisis). 


Changing Ranges of Responsibility 


The analysis of changes in the partitioning of design 
activities among specialists shows, that every new 
System generation brings us closer to settling the hard- 
ware crisis and the interface crisis. Illustration 1 
shows the methodological levels of system synthesis (see 
"hierarchy of levels", chapter I in ref. 1), and the 
movements of the fields of activities of components/chip 
manufacturers (C), hardware designers (H), software 
engineers (S), and language designers (L) in the course 
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of system generations. The H field of generation 1 co- 
vers level 2 thru 7. Unsoved problems in level 2 keep 
designers from spending much time for levels 3 thru 7. 
In generation 2 the L field is taking over level 7 and 
the S field level 6. In generation 3 the S field takes 
possession of level 5 via microprogramming (survey: 
e.g. 12, 15), and the C field extends downwards to le- 
vel 2 by SSI chips. In generation 4 the C field will 
take over the level 3 via LSI technology, reasonable 
chip family planning provided (pragmatic definition of 
"reasonable": ref. 10). 


Towards an Integration of Design Methods 


Design activities in a well formed level 4 mean the use 
of high level hardware design languages. So design and 
description methods without the use of any tools from 
levels 3 thru 1 will be possible, a proper set of re- 
gister transfer primitives provided (9). This means 
close similarity of formal tools between S field and 

H field. So generation 4 is the opportunity for a fu- 
sion of S field and H field to form an I field of "in- 
tegrated design methods" (see Illustration 1). This 
would settle the interface crisis by integrated design 
approaches such as "top-down design" (14) or "bottom- 
up design", as practised in developing some high level 
language machines (survey: in ref. 5). 


Excellent teaching of integrated design requires a com- 
bination of software mens procedural way of thinking 

and of hardware mens functional and black-box-oriented 
way of thinking. Descriptional tools for such a metho- 
dological combination would be the use of high level 
programming languages with hardware description features, 
together with a "grey box" (14) block diagram language 
(9,10). Illustration 2 shows an example of such a "grey 
box" block diagram, equivalent to the register transfer 
statement 


on t do PA := if AE then PE2 else PEI; 


where t is a clock signal and AE is a static condition. 
Such a simultaneous use of a symbolic notation and a 
grey box diagram in teaching would help to keep aware 

of the duality of algorithms and their hardware carrier 
structures. The grey box diagram in this case is no des- 
cription of a particular hardware component, but it is 
some sort of low level extension of semantics or some- 
thing like generalized pragmatics of a language. 


The "Hardware Crisis" 


One symptom of the "hardware crisis" (and of an in- 
creasing awareness of it) is an arizing discussion on 
hardware design philosophies, accompanied by a condem- 
nation of a certain kind of traditional hardware struc- 
tures, which could be called "tricky hardware". ROSIN 
discovers (but not approves) 4 "rules of thumb", un- 
consciously used by many machine designers (17), and so 
causing lots of trouble and inefficiency on the software 
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Rule 1: "In case of doubt, sacrifice a design concept 

to preserve cycle time’. Rule 1 refers to an unreasonab- 
short-sighted MIPS-squeezing, neither regarding effi- 
ciency of the instruction set, nor the hardware/software 
cost ratio. 


Rule 2: 'Some facilities are cheap". Rule 2 refers to 
hardware links not intended in the original plan, in- 
troduced for adding "extra features with no extra cost". 


Rule 3: "Design constraints don't allow the realization 
of some otherwise good ideas''. The design constraints 
in rule 3 are those, imposed by an unreasonable set of 
preconceived components. 


Rule 4: "If it looks nice, it must be beautiful". Rule 
4 remembers to features, which are "monuments to the 
cleverness of the designer" (17). 


To meet these problems we need a strong influence on 

the designing habits in the field of machine organi- 
zation, aiming at the design of "structured hardware" 
instead of "tricky hardware". We need the promotion of 
the virtues of "ego-less" hardware designing. The HIM 
scheme, presented in this paper, used as a design guide- 
line will be a help in designing "structured hardware". 


THE MODELLING OF DIGITAL PROCESSORS 


Sequential models 


The subject of this paper is based on the sequential 
version of the information structure model on the exe- 
cution of programs (see chapter 4 in ref. 18), where 

an information structure model is a tripel M = (J,J°,F), 
with the set J of information configurations (snapshots), 
its subset J9° C J of initial information configurations, 
and the set F of operators on J. The set J is subdivided 
by J = (C,P,D) into a control component C, a program 
component P, and a data 
component D, according 

to illustration 3. In- 
struction pointer ip and 
data pointer dp are scan- 
ning P and D under con- 
trol of C. The scheme in 
illustration 3, not be- 
ing delivered from hard- 
ware men, is incomplete 
for architectural use, as 
is not showing the em- 
bedding of F into the 
model. 


ILLUSTRATION 2 


The embedding of F is performed by another model, not 
being delivered by software men, showed by illustra- 
tion 4, and described elsewhere (e.g. 20,21). The con- 
troller" K combines P and C from illustration 3. Block 
F containes the recources for the implementation of the 
set F of operations. F and D are connected by data paths 
for the transfer of arguments and results. The "order 
vector" Y is a selector word for selecting and acti- 
vating the particular subset of F, required for the 
actual step. Status vector X denotes the feedback from 
F to C for decision purposes. 


Automata-oriented Modelling:Some authors model K sepa- 


rately (e.g. 3,7) and one models the combination of K 
and C (20,21) by finite state machines. It has been 
demonstrated (8), that these models may be extended in- 
to a hierarchy by replacing the model of K by a finite 
state transducer (see illustration 5). Thus we get a 
hierarchical model, according to illustration 6, where 
a machine M; receives instructions in its machine 
language L;. Controller K; inside M: translates (on-li- 
ne) L; into orders in language L;_,, the machine 
language of an inner machine M;_,. This scheme may be 
tiested to form a hierarchy of machines: M; is the inner 
machine of Mj,,. M;_, may have an inner machine M;_» etc. 


Programming Language-oriented Modelling: The hierarchy 


of processes in a program-controlled digital processor 
implements a hierarchy of languages (e.g. see ref. 13), 
and does not primarily appear as a hierarchy of auto- 
mata. One important disadvantage of automata-oriented 
modelling is, that it fails in modelling the entry of 
immediate data form instruction streams or statement 
streams. (By the way: that is an important difference 
between microprogram control and hardware control). 
What is needed, is a model, that is more language- 
oriented. From a hardware point of view, this leads to 
modelling in terms of interpretations, instead of state 
transitions. 


"Grey Box'' Use for Integrated Modelling: "structured 


programming" techniques for the design language de- 
scription of computer architectures and machine organi- 
zations makes flow charts superfluous. The space made 
free by throwing out flow charts may be used for "grey 
box" diagrams. In teaching we achieve by this a better 
understanding of programming language principles and 
its semantics, of hardware/software interface problems, 
as well as integrated design methods. The next section 
of this paper suggests the HIM scheme as a "grey box" 
modelling framework for these purposes. 


II. INTERPRETIVE MODULES AS MODELS 
REFINEMENT OF AUTOMATA-ORIENTED MODEL 


A model, which may be regarded as a refinement or an 
implementation of automata-oriented models, can be de~ 
rived by the fact, that each program execution is sub- 
divided into cycles with the following 3 subcycles (13): 


1. The fetch subcycle for the selection of the next 
element 1, € L from the program store P, where L is 

the language L = {1,,1,,...,1,}. This selection is per- 
formed via adjustment of the instruction pointer ip 
(see illustration 3). 


2. The recognition subcycle, which performs a test, 
whether the selected element P[ip;] is a legal element 


with Plip; EL, and, which performs the recognition of 
the specitic 1,€L, being represented by P [ip;], if 
legal. 


3. The execution subcycle, which performs the proper 
semantic operations on the data structure D, as resul- 


ting from the recognition subcycle. 
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ting the semantic unit Fy4) of the next higher language 
states | level i+]. As showed by illustration 8, the module Fy, 
controller prozessing receives orders via the transfer path A; from module 
K=(C,P} module storage Ri4, of the next higher level. After having transmitted 
be PEsources?)) | an order word via Aj to Fj+, the module C;4,),, together 
with P34) and Rj4] and all modules of all higher levels, 
remain in an inactive "waiting state", until via path S; 


(lables of final 


In a particular level i within the hierarchy, the sub- an "end-of-execution" message (end) is fed back from 
cycles 1 and 2 are performed by controller Kj, while Fi, ,- Thus the two paths Aj and Sj (see also illustra- 
subcycle 3 is executed by the machine M-_,. A refine- tion 5) are vertical links to the next higher level of 
ment of K; according to subcyclés |! and 2 is showed by carrier hardware modules. On the other side the two 
illustration 7: Pj denotes the program store, con- paths Y; and X; are links to the next lower level mo- 
taining the interpreter I;, Cj denotes the control mo- dules, formed inside F; (also see illustration 5). Thus 
dule, responsible for the proper sequencing of the a hardware-supported block structure (or a grey box 
stream of instruction words from P;, entering Cj via structure) of nested interpretive modules, called HIM 
instruction buffer IB;. Rj; denotes the recognition de- scheme (hierarchy of interpretive modules) is formed. 
vice, having the 2 submodules (not showed here) CL The above linkage for communication between language 
(classifier) and AL (action lexicon). The output of Rj levels of different order is called "vertical linkage" 
(produced by AL) is the order vector Yj, evoking se- or “vertical subroutine linkage" (as this scheme links 


mantic actions to be performed by machine Mj_;, and the interpreters like a subroutine calling mechanism). 
control vector Y;1, evoking control actions to be per- 


formed by control module C;. 
y * Connections Inside One Level 


A second step of refinement is the subdivision of M;_, Inside one level of the HIM, there the following trans- 

into the submodules F; (functions) and D; (direct data). fer paths for interconnecting the modules (see illu- 

By both refinements we get a scheme, as showed by illu- stration 8). A call via path A; causes the transition 

stration 8. Fj is the implementation of the available of K; to an initial state via adjustment of IP; to the 

set of operators on Dj. The direct data structure Dj is appropriate program entry point of I;. Such a call is 

a set of registers for temporary storage. The control- evoked by the end message of the forerunner program in 

ler K;, refined by this scheme, will be called "inter- P;, recognized by Rj as described above. The fetched 

pretive module" (IM). instruction P; [IP], buffered into IB, is analyzed by Rj. 
The resulting order vector Yj is looked up from ALj in 

LINKING INTERPRETERS TOGETHER TO FORM A HIERARCHY Rj and fed to Fj, evoking the activation of the required 
subset of data paths in F; for the appropriate transfers 


between registers in D; and from emit fields Ej (in IB) 


Vertical Linkage 


Before the description of signal transfers between the 
modules in illustration 8 is completed, the synthesis 

of a hierarchy of such structures is demonstrated (see 
illustration 9). The I; programs in P; and the modules 
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or E; (from higher levels, if implemented). 


I/O to or from Dj is activated in an indirect manner 
by placing I/O control messages into special inter- 
face registers in D;, as for instance a "read"-bit 
or a "write"-bit, when core storage is external, as 
in the microprogram level e.g. The control vector Yj" 
derived by R; from IB; controls via decision logic 
DL{ (see illustration 10) inside Cj the adjustment 
of IP; for the next fetch cycle. The adjustment of 
IP; is executed by the modify paths in MDY in the 
feedback loop at IP;. MDY containes adder, incremen- 
ter etc. By path X; the following influences on the 
operation of Cj are implemented: the cycle time of 
C; via a clock bit in X;, the decision for the ad- 
justment of IP; by status bits in X;, and the request 


ILLUSTRATION 9 
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of the next call from the next higher level by an 
end-bit in Y;r. 


Functional Partitioning of resources may be modelled 
by splitting Yj; up into subvectors Yj(1), Yj(2),... 
and transmitting them to separate submodules F;(1), 
Fi (2),... of module F;. Such a partitioning may be 
modelled in any level. Modelling the submodules of 
Fj; by the HIM scheme shows several K;_; modules and 
module Kj selectively calling one particular Kj-1 
module per cycle. 


Parallelism and its modelling by the HIM scheme is 
not the subject of this paper. The modelling of 
clocked parallelism is trivial .and-is included into 
the HIM scheme automatically. The introduction of 
Synchronizing devices for asynchronous parallelism 
into the model is not impossible and may be achieved 
by modifying the C module. 


Horizontal Linkage or "horizontal subroutine linkage" 
is the name for a subroutine calling mechanism within 
the same language level: the call of a subroutine 
within the interpreter program Ij (both, the called 
subroutine, and the calling routine are within the 
same module P;). 


THE SUBMODULES WITHIN ONE LEVEL 


The P module may be a simple program store, when all 
parts of I; are always resident. But the P module 


may be extended by mechanisms for "load and call" or 


for "compile, load and call" of non-resident parts 
of the interpreter I;, and may include tables. 


The C module may be a relatively simple structure, if 
no subroutine techniques for P; are used. For model- 
ling subroutine techniques IP; is extended into a 
pointer stack IPS;, and in the case of an internal 
subroutine call the request of an order via path A; 
is replaced by push operation on IPSj. 


In some cases "immediate orders" are entering C; via 
A; (e.g. the module of the DEC pdp-8 machine, 
during the processing of microcode from special ma- 
chine instructions). In this case IB; receives its 
input directly from the A; input, and not from Pj. 
The decision logic DL has to provide a tag sensing 
feature for recognizing "immediate" orders entering 
A;. If we extend IB; into a stack IBS; for pushing 
immediate orders and use another stack in the Dj mo- 
dule, we are able to model the direct evaluation of 
expressions, such as illustrated by the shunt sta- 
tion model. 


The R-module may be a decoder network, when used for 


modelling the decoding of a machine instruction or a 
microinstruction. R; may be implemented in a more so- 
phisticated manner and sequentially, when constructs 
of a higher level language with a wide variety of 
sentence formats have to be analyzed. The decoding of 
instructions of byte-oriented machines is an example 
of decoding variable lenght objects in parallel. In 
ref. 9 is demonstrated, how the equivalence of se- 
quential networks and combinatorial networks can be 
used for a uniform modelling of decoder networks and 
a class of parsing algorithms in terms of a set of 
register transfer primitives. This idea allows a uni- 
form modelling of a wide variety of R-modules from 
different levels of a complex digital systems. So the 
AL~submodule of Rj may be a table (in higher levels), 
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a wiring scheme (in lower levels) or even a model for 
the pulse phase level below the register transfer level 
which is useful for pedagogic purposes and for the ana- 
lysis of the behavior of certain register transfer 
structures without using formal tools of the level of 
logic design (10). 


The F-module is a combination of all transfer carriers, 
used for semantic purposes, such as gated transfer 
paths, paths from and to registers for the implementa- 
tion of register assignment operations, and transfor- 
mational Transfer paths, such as arithmetic and logic 
units, when it is used for modelling F; in the micro- 
program level. In higher levels the F-module appears 

as a combination of abstract transfer carriers, yielded 
by omittion of intermediate transfer steps, implemented 
in lower levels. 


The D-module is the set of all data containers, such as 
read/write registers and read-only registers (constants 
or input terminals from emit fields), which are direct- 
ly or implicitly adressable by constructs of the 
language, used for I;. Those data containers, which 

are adressable by language of I; only indirectly, are 
parts of the external data structures, called I/O in 
illustration 9. Those data containers are directly or 
implicity addressable only within one of the higher 
levels of the hierarchy. Those registers, which are not 


> 


at all addressable by the language of I;, may be addres- 


sable within lower levels of the hierarchy, not mo- 
delled in the present level. Sometimes there are re- 
gisters, which belong to 2 (or more?) levels' D-modules 


simultaneously, as for instance the accumulator register 


or sometimes other general registers, addressable by 
Ly and by and thus belonging to Dp and D, simul- 
taneously (see illustration 10). In such a case Dy, and 

are overlapping, as e.g. shown in illustration 9 and 
10 (here m stands for "machine language" and yu for 
"micro language"). 


III. POSSIBLE APPLICATIONS OF THE HIM SCHEME 


MODELLING EXISTING SYSTEMS 


Illustration 10 demonstrates the pedagogic use of the mo- 


del for structuring the register transfer carriers of a 
microprogrammed instruction set processor by modelling 
it into a 2 level's scheme. The HIM scheme also is use- 
ful for modelling more than two levels, and, for mo- 
delling higher language levels. The embedding of com- 
piler and assembler software into the framework of the 


HIM scheme will be possible by using 2 different schemes: 


one scheme for compile time modelling, and one scheme 
for run time modelling. 


Classification of Architectures in Terms of the HIM 


All digital systems are implementations of language 
hierarchies. A very useful criterion for classifying 
computer architectures is the degree of directness of 
the hierarchy implementation. It is more direct in pure 
interpretive systems, than it is in compiler-oriented 
systems. It is the more direct, the more language le- 
vels have C modules and R modules, being implemented 

in hardware, instead of software. Virtual (software- 
implemented) C and R modules require a time-sharing of 
lower levels'C and R modules, and in many cases of F 
and D modules too. In conventional instruction set pro- 
cessors, for instance, the higher language levels share 
with the machine language level in the interpretive mo- 
dule of the latter level. Let me classify this as 
"tricky" implementation of hierarchy and call it "ver- 
tical multiple use". It seems practical, however, to 
classify "horizontal multiple use" (subroutine mecha- 
nisms within the same language level) not as "tricky". 
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Let me give further definitions of hardware structures. 
"Structured hardware" is the basis for a direct imple- 
mentation of a hierarchy, and an indirect implementa- 
tion of a hierarchy of languages results in hardware, 
which is less "structured". Another criterion of the 
directness of an implementation is the existence of 
vertical multiple use of working stores and its acces~ 
sing hardware for P and D modules. At least in this 
respect the “stored-logic machine" and the 'von-NEUMANN 
machine" show a considerable degree of indirectness in 
hierarchy implementation. 


Trends towards Structured Hardware 


In the 7Oies there is a tendency for transferring more 
and more complexity from the software part to the hard- 
ware part of systems. This is demonstrated by the suc- 
cessive advent of the following classes of architecture 
(5): 

1. von NEUMANN-type architecture 


2. syntax-oriented architecture (e.g. the B 5500) 


3. IHLL-architecture (indirect high-level language, 
survey: ref. 5, also see ref. 11) 
4. DHLL-architecture (direct HLL (5, also: 2,6)) 


This sequence of architectures demonstrates the 
following developmental trends in system concepts re- 
search: 


1. Increasing directness in implementing 
language hierarchies, which means in- 
creasing similarity to the HIM scheme. 


The increasing complexity of hardware 
makes it more and more advisable, not 
to produce tricky hardware, but to 
produce structured hardware instead. 


Because of low hardware cost, the ex- 
tremely efficient utilization of parti- 
cular hardware submodules is no more 

a relevant design objective. We now can 
afford idling modules. This leads to 
more hardware instead of tricky hardware. 


4. A growing tendency to functional parti- 
tioning of hardware (e.g.: ref.5 ) yields 
a tendency to more structured hardware. 


These developments make systems more and more appro- 
priate for beeing modelled by the HIM scheme, and the 
use of integrated design methods, as e.g. supported by 
"srey box" modelling via HIM scheme. 


Straight-on Teaching of Computer Architecture 


For the use as examples for teaching computer architec- 
ture we have now available: considerable know-how out 
of papers on DHLLP design (bibliography: 5), as well 
as knowledge and teaching experience on conventional 
architectures. The question arises: which type of 
examples gives us more benefit in understanding pro- 
gramming language principles and semantics, in the 
linkage between language levels, in computer architec- 
ture and machine organization? The author believes, 
that for the student the direct implementations of 
hierarchies are less opaque, than indirect hierarchy 
implementations. The introduction into the "tricks" of 
indirect implementations would be better scheduled as 
a second step, after a successful teaching of direct 
implementations, relying on "structured hardware''-type 
hypothetical (or real) architecture examples. 


IV. CONCLUSIONS 


For the philosophy of straight-on teaching, the HIM 
scheme has been proposed as the framework for a hypo- 
thetical direct architecture example, as a "grey box" 


model for teaching integrated hardware/software design, 
and (with some restrictions for economical reasons), _ 
as a guideline for teaching structured hardware design, 


in order to avoid cumbersome tricky hardware and so to . 


promote the virtues of the “ego-less" system design. 
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REVIEW OF THE WORKSHOP ON 
COMPUTER ARCHITECTURE EDUCATION 


George Rossmann 
Palyn, Inc. 


Abstract 


This paper reviews the presentations and discuss- 
ions of the participants in the Workshop on Education 
and Computer Architecture held in Atlanta, Georgia on 
30 August 1973. 


| INTRODUCTION 


A workshop on education and computer architecture 
was held on 30 August 1973 in Atlanta, Georgia. It 
was cosponsored by the ACM Special Interest Group on 
Computer Architecture (SIGARCH) and the IEEE Computer 
Society Technical Committee on Computer Architecture 
(TCCA). 


The goal of the workshop, as stated in the invi- 
tation to participants, was to develop the foundations 
for a series of courses that would provide a good ed- 
ucation in computer systems design. What we had in 
mind was the detailed specification of two or more 
courses in computer architecture which would replace 
both the computer organization course developed by a 
COSINE TASK FORCE (3) and the organization courses (13 
and A2) outlined by the ACM Curriculum Committee on 
Computer Science (1). However, the contributions re- 
ceived from the workshop participants addressed such a 
broad perspective of computer systems design that we 
ended up discussing most of the elements of a typical 
computer engineering curriculum (4). The workshops' 
observations about current programs in computer engi- 
neering education and its proposals for introducing 
some fresh ideas into these programs are the primary 
reason for circulating this summary of our discussions. 


To insure a broad perspective, participants were 
invited from universities and computer manufacturers. 
They were asked to contribute in whatever way they 
could. We solicited papers for oral presentation as 
well as position papers from those invitees who were 
unable to attend but still wanted to contribute their 
ideas. The result was a workshop attendance of twenty- 
four. There were ten formal presentations given and 
three position papers which were summarized. 


The program was as follows: 


WORKSHOP ON EDUCATION 
AND COMPUTER ARCHITECTURE 


Program 
9:00 a.m. - 12 Noon 


W.J. Watson, ''What is This Thing Called Architec- 
ture, and Where is it Going?" 


R.A. Dammkoehler, ''Experimental Modular Machines" 


D. Siewiorek & J. Grason, ''Using Register Trans- 
fer Modules (RTM's)in Teaching Computer 
Architecture’! 


R. Ashenhurst, "Hierarchical Systems for Labora- 
tory Automation'! 


R. Rosin & B. Shriver, ''Towards Reasonability in 
CPU Design: A Case Study'' 


Te Rauscher, ''The Influence of Specific Problems 
on Machine Architecture'' 


1:00 p.m, - 5:00 p.m. 


F. Brooks, ''Computer Architecture Education!’ 


S. Fuller, ''An Annotated Reading List for a 
Topics-Oriented Course On Computer Struc- 
tures'! 


C. Hooper, ''Study of Computer Architecture 
Through Simulation! 


G. Lipovski, ''A Course in Top-Down Modular De- 
sign of Digital Processors" 


H. Hellerman, ''New Emphasis in Computer System 
Education"! 


H. Lorin, "Operating Systems Education"! 


F. Hill, Position Paper on ''Digital Systems: 
Hardware Organization and Design'' 


The workshop presentation and discussion can be 
approximately partitioned into three main topics: Dis- 
cussion of the educational methods used to build com- 
puter systems design experience, specification of the 
structure and content of some courses to be included in 
a computer architecture sequence, and finally, descrip- 
tion of some of the elements of computer science educa- 
tion which are considered to be essential in the train- 
ing of a computer architect. 


| Educational Methods 


Several strategies for teaching computer systems 
design have emerged: simulation, modular systems de- 
sign, the case study approach, and theoretical studies. 
The appropriateness of each of these depends upon the 
objectives and the level of the instruction. 


Simulation is well accepted as a means for study- 
ing computer structures (3). Its use was discussed by 
C. Hooper. In conjunction with a senior-level course 
on computer architecture, he offers an extensive soft- 
ware laboratory component in which students are requir- 
ed to design and program a simulator and cross assem- 
bler for some wel] defined computer system in order to 
study its characteristics and evaluate its features. 
This. requires an enormous amount of time. The student 
increases his programming experience and finishes the 
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course having examined at least one machine architec- 
ture very closely. This is probably sufficient accom- 
plishment for an undergraduate course, but it would 

not be appropriate for developing the breadth of under- 
standing needed by a professional computer architect. 
lt was suggested that his needs might be better served 
in other ways. For example, he might be required to 
program a few problems on a set of machines to get a 
feeling for the effect of architectural decisions on 
problem processing. Or he might work with a single 
machine and a single problem and determine the effect 
on the program for that problem caused by deleting cer- 
tain features of the machine. 


Modular systems introduce an efficient experiment- 
al dimension to computer systems design. The October 
1973 issue of COMPUTER surveys the state of their de- 
velopment, describes experiences various groups have 
had in using them, ‘and offers some conjectures on why 
designers and users are attracted to them. At the 
workshop, Fred Dammkoehler and Dan Siewiorek discussed 
the impact of modular systems in studying computer ar- 
chitecture. 


Dammkoehler argued that in engineering effective 
modeling of real phenomena is the key to understanding 
them. He suggested that, in the context of computer 
engineering, modular systems which lend themselves nat- 
urally to model construction and manipulation as well] 
as allowing students to maintain contact with reality 
serve the learning process best. The ways in which 
macromodules accomplish this were observed in the con- 
text of a comparative architecture study conducted by 
his graduate students. Experiments were undertaken in 
order to determine the extent to which the efficiency 
of a structural analysis procedure could be improved 
by systematically increasing processor concurrency. 
Three processor structures, serial, asynchronous and 
pipelined, and a hardware monitor were built within a 
reasonably short period of time and the analysis pro- 
cedure was executed on each of the processors. Good 
results were obtained indicating that by using a set of 
appropriately designed modules it is possible in "rea- 
sonable'' amounts to time to explore a number of design 
alternatives. 


Siewiorek described similar experiences with the 
use of Register Transfer Modules (RTM's) in studying 
computer architecture. A series of laboratory exer- 
cises of increasing complexity; design of a desk cal- 
culator, display processor, simple computer, pipeline 
processor, etc., were used to develop computer systems 
design experience. What is remarkable about this lab- 
oratory is that it is used in conjunction with a junior 
-senior level Computer Systems course. 


Modular systems make it possible to consider rel- 
atively complex digital systems at a functional level 
where they can be handled by the electronically naive, 
and they make the design and implementation of such 
systems quick and understandable. The modular approach 
results in extraordinary student motivation. Labora- 
tory experiences in which students can work in close 
co-operation to build systems which solve real prob- 
lems are becoming much more significant. Simulation 
approaches fail to develop that intense commitment. 


Jack Lipovski presented the description of a 
course for designing digital systems in an MSI and LSI 
hardware enviornment. Commercial chips are used as 
modules. They are incorporated into a design by de- 
ducing from the statement of the problem via a high 
level hardware design language what modules are requir- 
ed and how they should be interconnected. 


The case study approach addresses the upper levels 
of computer systems design: computer architecture, 
which defines the attributes of a computer system as 
seen by the programmer; and physical implementation, 
which includes the organizations of processors, mem- 
ories, switches, input-output devices, etc.. There 
was no dispute with the proposition that at the grad- 
uate level the proper study of the machine designer is 


‘machines and that there is no substitute for close ex- 


amination of other designers’ machines. Fred Brooks, 
Sam Fuller, and Joe Watson discussed this strategy. 


Professor Brooks point of view was especially in- 
teresting. He distinguishes computer architects from 
computer engineers. In his view, computer engineers 
are responsible for implementation and computer arch- 
itects are responsible for the principles of operation 
manuals. Further, he characterizes training in compu- 
ter architecture as elementary and advanced. The ele- 
mentary level is designed to teach every computer sci- 
entist how machines work and to help him understand 
the forces that led to the design decisions which he 
has seen reflected. Any competent computer science 
instructor can teach it from the literature. The ad- 
vanced level is designed to teach the 5 to 10 profes- 
sional computer architects who are needed by industry 
each year how to come up with a master plan. This 
training, he asserts, can only be offered by experi- 
enced and practicing computer architects. !t cannot 
be acquired from the literature and there is not sub- 
stitute for learning it from someone who has been 
through the design experience of a real machine. 


111 A Basic Course 


There are three aspects to Brooks' course on com- 
puter architecture. The first is an in-depth analysis 
of a two demensional computer space like that shown fn 
Figure 1. 


The first four topics form the fundamentals of 
computer architecture; although, Tom Rauscher pointed 
out that the design of special purpose machines to 
solve specific problems may make other aspects of com- 
puter systems designs more significant. The computer 
systems are ordered to preserve their evolutionary 
development. The space is scanned in both directions 
in parallel. The horizontal dimension compares dif- 
ferent techniques for solving classical problems and 
charts their evolution. The vertical demension shows 
how solving one of the problems in a particular way 
constrains the solution of another problem to be less 
than best. Studying individual machines in the ab- 
sence of such a typical framework; i.e., a pure case- 
study approach, frequently produces unsatisfactory re- 
sults. 


The challenge in analyzing systems is to figure 
out what the designers' reasons for their decisions 
were. The space helps to discover the rationale be- 
hind some of these decisions. The reasons for the re- 
maining ones may not even be technical and impossible 
to deduce. 


Bob Rosin and Bruce Shriver discussed some of 
these subtleties which accompany design. It is their 
contention that some of the decisions made by computer 
systems designers and implementers are unreasonable. 
They are based on invalid rules of thumb, a narrow and 
incomplete view of the ultimate users' needs, and the 
use of inappropriate tools. Their paper presents a 
case study of a real machine design and shows how it 
was possible to make a somewhat reasonable system out 
of a potentially unreasonable one without sacrificing 
anything other than some traditionally held bad ideas. 


Figure 1 
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In light of this, the students problem is not 
only to seek the reasons for decisions, but he must 
try to assess their ultimate consequences. The teach- 
er should provide as much real frequency data; e.g., 
opcode distributions, branching conditions, etc., as 
he can locate to support the students analysis. UI- 
timately, the student has to develop an intuitive 
sense for the dynamics of machine activity. 


The second aspect of Brooks' course is a complex 
software laboratory project based on a real industrial 
or university need. The goal of such a project is not 
to learn how to program, but, just as in the case of 
modular systems, to learn to work as a member of a 
team which must build, debug, document, and demon- 
strate something on a schedule. 


The final aspect involves reading classical pa- 
pers. Sam Fuller offered an annotated reading list in 
which student reaction to many of these papers is sur- 
veyed. Statistics were gathered on responses to five 
questions: Did you read it?, clarity, detail, under- 
standability, and value. Fortunately, many papers are 
available in the text by Bell and Newell (2). 


Some comments suggested that an equally valid ap- 
proach to teaching a course like this could be devel- 
oped from an implementation driven point of view. 


IV Essential Supporting Courses 


An essential part of any computer system is its 
operating system. The architecture and implementation 
of a machine cannot be separated from the operating 
epg which runs on it, since together they ee 


tute the na TORE which the user sees A cou ln 
“ inmeipy essential co rt) %% urs 
aperehg 3 arc shitecture (3): “Otherwise, "as Sam f "We er 


pointed out, the rationale behind various architectural 
decisions and the reasons for including certain topics; 
e.g., virtual memory, in computer architecture courses 
would be incomprehensible to those students who have no 
operating system experience. 


Hal Lorin addressed the fundamental problem of op- 
erating systems education directly. The problem is 


that there is so much material and so many paths 
through the material that it is almost impossible to 
manage or defend a single unified approach to design 

or discussion. The working solution to the problem has 
been to fragment the material into separate disciplines, 
to design elegant internal structures without compre- 
hending or investigating their impact on the user, to 
concentrate on a few problems of current interest to 
the exclusion of others, to represent the structure of 
a single system as a natural structure for all systenis, 
to exclude historical material, and to avoid the full 
implications of systems use. All these prevent the 
computer architect from achieving significant insights 
into the dynamics of the relationship between operating 
systems, computer architecture, and the computational 
environment. The fundamental need in education there- 
fore is to create an appreciation within the computer 
architect for the essential unity with which a user 
sees his system, the various ways in which his machine 
will be used, and how the user judges the success or 
failure of his system. 


This last point was expanded by Herb Hellerman. 
He argued that a consciousness of system performance, 
which he defined as an assessment of the qualities and 
features of a system by as objective a set of measures 
as possible so as to judge its ability to process work 
for some end and to compare it with other systems, 
should permeate throughout the computer science curric- 
ulum. By using objective and quantitative measures of 
performance as much as possible and qualitative measur- 
es when real measures do not exist, a student ought to 
be able to develop some sense of the worth of a com- 
puter system. If he combines these with cost data, he 
should be able to evaluate the cost effectiveness of a 
system. Joe Watson also represented this process as 
fundamental from the manufacturers point of view. 


V Conclusions 


Based on what the workshop considered, it might 
have been more appropriately entitled Workshop on Com- 
puter Engineering Education. It clearly demonstrated 
the growth and interdependence of the subjects recom- 


mended for a computer engineering curriculum in (4). 
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Of particular interest is the emphasis now being placed 
on the laboratory experience, both hardware and soft- 
ware, which should accompany these upper-level courses. 


A special task force is now being constituted to. 
realize the original goal of the workshop; the develop- 
ment of new course descriptions for computer architect- 
ure courses. The suggestions and contributions made by 
the participants in this workshop should make that ef- 
fort simple and fruitful. 
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MICROMODULES: MICROPROGRAMMABLE 
BUILDING BLOCKS 
FOR HARDWARE DEVELOPMENT 


Richard G. Cooper 
National Security Agency 
Fort Meade, Maryland 


I. Introduction 

The algorithm design phase in the deve= 
lopment of special purpose hardware is 
usually a very small part of the overall 
effort. A much larger portion - often 90% or 
more - is expended on logic design, fabri- 
cation, and debug. Furthermore, since the 


pure algorithmic complexity of hardware tends 
to be small, algorithm design errors 
typically account for a small part of the 
total debug time; errors due to electrical 
effects consume the lion's share. Precautions 
taken during machine design, fabrication, and 
debug to minimize reflection, switching 
noise, and synchronization errors are time 
consuming and expensive. This situation is 
at its worst when various types of equipment 
are to be produced in small quantities. As 
the use of Schottky-TTL and ECL increases, 
the problem will become more severe. 

The purpose of the Micromodules project 


is to greatly reduce the amount of effort 
expended on logic design, fabrication and 
debug for small quantity developments. 


Secondly, with this modular approach, a quick 
reaction capability is sought that would 
allow a large reduction in the time interval 
between system specification and the delivery 
of the finished product. Finally, by simul- 
taneously simplifying and speeding up _ the 
development process, we aim to improve the 
practicability of implementing more complex 
equipments. 

These goals can be achieved by the 
development of a family of microprogrammable 


modules. Each module will be architecturally 
compatible with a small class of common 
hardware structures with obeisance to a 


standardized interconnection discipline. The 
system designer will obtain a collection of 
modules from inventory and configure them, by 
means of the interconnection discipline, into 
a system which is architecturally suited to 
solve the problem at hand. 

It is likely that many systems will 
require some special hardware development in 
addition to the standard modules; our 
intention is to minimize the quantity and 
complexity of such special equipment. As the 
project progresses, additional common 
structures will be identified and the family 
of micromodules will be expanded to contain 
them when justified. 

Our approach is not without precedent; 
the Macromodules project [1,2,3] at 
Washington University has been a fundamental 
source of inspiration. There, under the 
direction of W. Clark and C. Molnar, a set of 
asynchronous building blocks were con- 
structed, These can be interconnected with 
standard cables. Loading factor allowances, 
noise attenuation and techniques for synchro=- 


221 


nization were built into each module. 
Functionally, their modules are quite simple. 
Using adders, registers, memories and other 


modules of similar complexity, they can 
construct systems of interconnected blocks 
which are effectively free from electrical 
errors. System implementation can be accom=- 
plished quickly and easily; it is not 
uncommon for an engineer to design, 
construct, and debug a significant system in 


a matter of days. 

Due to the functional simplicity of each 
module, the relative cost of eliminating 
intramodular electrical errors is high. 
However, macromodules are intended for the 
construction of experimental equipment. A 
number of modules are configured to implement 
a certain algorithm; the system is used for a 
short period of time and the modules are then 
returned to the stockpile for later use, In 
such an environment, the cost of each module 


is not very important. It will be used in 
many different implementations and only a 
fraction of each module's cost need be 


attributed to each use. The time and effort 
required to build each experimental system is 
the more important consideration. 

Our approach has been to apply the 
macromodular concept to the development of 
unique operational special purpose equipment. 
In this environment, the cost of each module 
is quite important; it will be used in only 
one machine; therefore, the relative cost per 
module of eliminating electrical errors must 
be reduced. To achieve this reduction, we 
chose to increase the functional power of 


each module rather than relax the inter- 
connection discipline. A given algorithm 
would be implemented with fewer, more 


powerful modules; as a result, the overhead 
of eliminating electrical errors is reduced. 


II. Microprogrammed Machines 
Note that a more complex module 
increases the danger of sacrificing the 


flexibility required for constructing special 
purpose hardware of greatly varied designs. 
If flexibility is to be retained, individual 
types of modules should be modifiable within 
the range of their architectures to suit a 
diversity of applications. For this reason, 
our modules are often microprogrammed, i.e. 
designed with alterable control memories. 
Integrated circuit PROMs (programmable read=- 
only memories) will be used to specify the 
functions to be performed by each module, 
When new applications of existing module 
architectures are required, new PROMs will be 
designed to tailor the modules to the 
application, With this approach, we can, in 
effect, create a wide variety of complex 
building blocks for -a minimum of 


developmental effort. . 

Most projects will 
ROM design. 
formatting functions will be satisfiable with 
basic designs, it is not likely that needed 
processing and sequencing functions will have 
been previously designed, Therefore, 
compared with the macromodular approach, a 
system built with micromodules will require 
more effort - weeks instead of days. 
Nevertheless, the effort involved in system 
implementation will be greatly reduced when 
compared with that of current, traditional 
hardware development methods. 

The total cost of a given implementation 
will probably also be reduced. Cost 
reductions will be achieved in three areas. 
Since effort can be translated into dollars, 
substantial savings will be gained in 
reducing total effort. Because the modules 
will be produced in quantity, the economies 
of scale create further savings. Finally, the 
extensive use of MSI and LSI technology, 
usually unjustifiable in one-of-a-kind 
equipment development, also contributes to 
overall economy. Offsetting these reductions, 
several factors require expenditures not 
normally accruing to equipment development. 
The cost of developing the modules, their 
associated production tooling and inventory 
maintenance costs must be distributed among 
the equipments produced. Any portion of each 
module's capabilities that is not effectively 
used in a given equipment must still be 
purchased. Quantitative comparisons of these 
factors cannot be made at this time, but it 
appears that the overall cost per equipment 
will be reduced. 

In order to clearly describe the 
micromodular approach, we must examine those 
user-microprogrammable machines currently on 
the commercial market. 

The great majority of commercial 
machines are oriented towards emulation; for 
this reason, they tend to be complex and 
expensive. Because the machines will be used 
in stand-alone configurations, the archi- 
tectural emphasis tends toward high speed 
full word arithmetic and logical processing 
overlapped with random access memory fetch. 
Very limited Boolean capabilities and almost 
no multiple Boolean decision and _ control 
functions are included. When used in special- 
purpose equipment, emulation machines require 
considerable interface logic. Relatively 
small amounts of local high speed storage are 


still require some 


common, because main memory offers large 
amounts of cheaper, slower storage. Due to 
the complexity of emulation machines, their 


cost prohibits multiprocessing systems for 
many hardware applications. Even when multi- 
processing is used, the burden of synchroni- 
zation falls on the microprogrammer, or 
hardware synchronization must be provided. 
Another large segment of the commercial 
market is directed towards the implementation 
of disk and tape controllers. Few of these 
are truly user microprogrammable and 
virtually all are fixed architecture 
machines. Synchronization of multiple machine 
configurations must be microprogrammed or 
. implemented by means of additional hardware. 
Recognizing the limitations of current 
machines, we decided to use a building block 
approach with the micromodules. Each module 
is designed to solve a small class of common 
hardware problems, without frills. The 


Although many data routing and — 


emphasis is on low cost, high instruction 
cycle rate, and the possibility of coopera=- 
tion between modules. Although the class of 
problems compatible with each module's 


architecture is small, several modules can be 


configured to achieve the requirements of a 
given implementation. 


III. Modular Design Considerations 


The separation of functions is an 
important theme in the design considerations. 
Since microprogramming can be a difficult 
task, the separation of functions is useful 
in dividing the problem into subproblems 
which can more easily be _ solved, Each 
subproblem can then be attacked using the 
most appropriate module. As new classes of 
subproblems are identified, new modules tuned 
to these classes can be developed. System 
debug can also be simplified by the 
subproblem approach; each subsystem can be 
debugged individually, postponing debug at 
the system level until the last subsystems 
are ready, 

Since systems will be constructed from 
collections of modules, synchronization and 
buffering are also important considerations, 
Facilities for synchronization and buffering 
are built into each module in hardware. 
In most practical cases, loop-free networks 
of modules can be constructed, freeing the 
designer from these problems. Where loops 
must be constructed, some simple precautions 
will ensure that no deadlock problems exist. 
As will be shown later in this paper, a 
minimal amount of programmed synchronization 
can greatly improve efficiency for certain 
kinds of processes, 

Connections between modules 
either arithmetic or Boolean. 
paths are eight bits wide (one byte). Each 
byte path is constructed by connecting a 
polarized ten=-conductor cable between a byte 
output port on one module and a byte input 
port on another, Each port maintains a FULL 
flip-flop which specifies whether the port 
contains data. When data is transferred from 
an output port to an input port, the FULL 
flip-flop in the output port is cleared and 
the FULL flip-flop in the input port is set. 
The transfer of data between ports and 
control of the FULL flip-flops during 
transmission are performed completely in 
hardware. Two wires in the ten=conductor 
cable are used for handshaking signals. 
Transfer between ports is accomplished by 
logic built into each port. Since each port 
contains its own data buffer register, the 
interconnected modules can be performing 
computations while the transfer is taking 
place. 

Synchronization of byte data transfers 
with processing is accomplished by use of the 
FULL flip-flops. If a microinstruction 
attempts to read data from an input port 
which does not contain data, completion of 
that instruction is suspended until data is 
transferred into the port by the handshaking 
logic. When an input port is read, its FULL 
flip-flop clears, allowing the handshaking 
control to transfer in another byte. Thus 
each access of an input port reads a new byte 
of data regardless of the input arrival rate. 


can be 
Arithmetic 


Similarly, a microinstruction that attempts 


to place data into an output port, which 
already contains data, is suspended until the 


222 


port empties. Since both input and output 
ports contain data storage registers, all 
byte transfers between modules are double 
buffered by the hardware. 

Each arithmetically oriented module can 
contain multiple input and output ports for 
byte data. Thus data words larger than eight 
bits can be transferred serially by byte or 
in parallel along several cables. Parallel 
transfers occur independently. Since modules 
can be processing while transfers take place, 
and since data is double buffered, the 
duration of data transfer can be several 
instruction times long without much 
degradation of performance. This relatively 
slow data transfer, combined with fixed 
loading factors and reflection charac~ 
teristics, reduces electrical interconnection 
errors to a low level. Resistor terminators 


are built into the input ports and 
interconnection cables are shielded. 
Boolean interconnections are of two 


kinds: level signals and pulsed signals. 
Level signals are useful for connections 
between the modules and peripheral equipment. 
Level signals can be used for controlling and 
sensing Boolean lines, e.g. tape and disk 
drives. Pulsed signals are useful for 
synchronization tasks within the network of 
modules. 

Coaxial cables are used for transmitting 
Boolean signals and each module can contain 
one or more Boolean input and output ports. 
Switches are provided on some modules’ to 
specify whether a port will be a level or 
pulsed signal device. 


Level signals are strobed into flip- 
flops at the beginning of each instruction 
cycle to assure unambiguous operation. 
Schmidt triggers are used in some modules to 
perform level conversion and signal 
conditioning. 

Pulsed signals require a rise and fall 


cycle of operation. A two phase flip-flop 
configuration is used on the input lines to 
synchronize pulsed signal transmission. A 
received pulse is stored in a flip-flop until 
the receiving module tests that flip-flop. 
When a pulsed flip-flop is tested, it is 
automatically reset. Pulsed signals are 
therefore not acknowledged in hardware by the 
receiving device. If a given system 
requires acknowledgement, this task must be 
performed in firmware. 


IV. Design Aids 


The design of ROMs, as has been 
previously stated, is a difficult task. For 
many projects, ROM design will be the most 
time consuming part of system implementation. 
For this reason, numerous ROM design aids are 
planned. Design aids will be written in 
time-sharing Fortran IV for the DEC PDP-10. 

A basic table-driven assembler will be 
constructed. Individual symbolic assemblers 
can then be written for each module by 
providing the basic assembler with the proper 
tables. 

A single preprocessor program will be 
used to expand macro routines prior to 
assembly. Alphanumeric text, consisting 
solely of macro control statements, will be 
input to the preprocessor. Expansion will 
then be independent of the individual 
assembly languages; thus the macro capability 
need not be provided for every version of the 


assembler. ROM designers must expand macro 
calls individually and then edit the expanded 
macro text into the body of the program. 

A functional simulation of each module 
will be provided. The ROM designer can then 
debug his microprogram by repeated cycles of 
editing, assembly and simulation in a manner 
similar to the debugging of software. 

An interconnection simulation routine 
will be used to debug configurations of 
modules. This routine will be an event-table 
simulator which enables the system designer 
to observe the interaction of the modules in 
a system. The degree of overlapped operation 
can be observed and the effects of altera- 
tions to individual modules on the configura- 
tion can be ascertained, 

When ROM designs are completed, each 
object program can be dumped to paper tape. 
A ROM can then be physically constructed by 
"burning" the pattern specified on the paper 
tape into PROM integrated circuits. 

System implementation would be accom 
plished by the software simulation process of 
ROM design, followed by ROM pattern fabri- 
cation, The ROMs would then be plugged into 
the appropriate micromodules obtained from 
stock. Standard cables, also obtained from 
stock, would be used to interconnect the 
modules. 


Ve Networks of Modules 


An important goal of the Micromodules 
project is to facilitate the construction of 
more complex equipments than are feasible 
with traditional methods of constructing 
hardware. In particular, we wish to encourage 


the use of large networks of modules. Two 
adaptations of well known techniques are 
expected to be of general use in such 


systems: pipelining and parallel processing. 


A. Pipelines 

Pipeline structures are particularly 
appropriate to the micromodules. Let each 
packet of data be represented as x. Let the 


function £(x) be computed by a pipeline of n 
stages, thus 


£(x) = fy (.o0f, 7 (Ep (X)) 000) 


as illustrated in figure l. 

In designing a pipeline, each processing 
element should compute the appropriate 
function ina fixed time period. Thus each 
packet spends an identical amount of time in 
each processor. If the subfunctions to be 
computed do not have identical computation 


times, synchronization circuitry must be 
included in the design. If some of the 
subfunctions require random computation 


times, the buffering of data packets must 
also be provided for the sake of efficiency. 
Furthermore, each processor should be 
designed so that its average computation time 
is approximately equal to that of the other 
processors. Because of these optimization 
problems, pipeline structures are not often 
practicable for the implementation of complex 
functions. 

However, a modified version of the 
structure (figure 2) allows the pipeline 
concept to be applied to a larger class of 
practical problems, Let the packet y be 
defined as the augmented pair of elements 
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y = (i,x) 


where i is a tag value 
representing the next subfunction to be 
computed, i.e. £5 (x). Let there be m 
processing elements Se for j=l,eoeosgm. After 
each processing element, there is a buffer q. 
of fixed size. Let bs be a Boolean 
feedback signal from a3 to g5 such that 


(initially, i=n) 


bs = 1 iff q-s is more than half full 
) otherwise. 

The b+ signal allows the processing 
element to determine the state of its output 
buffer. By considering the tag value i of 
its current packet and the state b. of the 
output buffer, each processor flakes the 


decision to pass the current packet to the 
buffer or to compute the next subfunction. 
Thus 


94 = g3(inl,fy(x)) iff (bj=1) and (1i>0) 
(.,x) otherwise 
for (0<j<mtl). 


Note that bpy,=l regardless of the state 
of qme This fact allows each processing 
element to contain the same microprogram. 
Furthermore, the microprogram is not 
dependent on m or ne Thus a pipeline 
executive microprogram can be written and 
debugged for arbitrary m and n values. The 
executive would only be concerned with 
reading and writing data packets, and with 
making the decision to process or pass_ the 
current packet. System design of a pipeline 
could be performed by combining the executive 
with a list of packet sizes and subfunction 
addresses in a table indexed by i, and _ the 
microcode for each subfunction. 

- Since the pipeline structure does not 
depend on m, fast failure recovery is facili- 
tated. The faulty module can be quickly 
removed from the pipeline and the system can 
be restarted with a structure of size m1. 
Performance would be degraded, but the 
structure could still operate with up to ml 
failures. 

It was stated previously in this paper 
that loop-free interconnect structures could 
sometimes be implemented for algorithms which 
contain loops. The simplest method would be 
to contain the loop within a single module by 
means of the microprogram. Loops can also 
be integrated into the pipeline structure by 
a modification of the definition of g.; let 
the subfunction microcode also compute 38, (i), 


the next value of the tag i. Thus 
qj = g4 (sq (i) , £4 (x)) iff sr be and (i>0) 
(1,x) otherwise 
with each iteration of a loop being 


considered as a new invocation of the 
subfunction. 
Either definition of gj preserves the order 
of packet throughput. Although one packet 
can be completely processed in the first 
element and another packet partially 
processed by each stage, each packet will 
leave g, completely processed and in the 
Original order. 

Because of the importance of the pipe-~ 
line concept, a data flow simulator has been 
programmed. The system designer can specify 


same 


the type and shape of computation time 
distributions (assuming they are independent) 
and the packet size for each value of i. By 
selecting values of m and by combining or 
reducing subfunction definitions, he can 
determine the most effective implementation 
of those considered. 


Be Parallel Processing 


The use of parallel structures like that 
in figure 3 is also anticipated, An input 
controller I is used to schedule the flow of 
input data to each of the m processors g., 
while output controller O merges the result 
streams, The mode of operation depends upon 
the characteristics of the function f, If 
f requires a nearly constant amount of 
processing time regardless of the data packet 
values, a phased sequence of processing can 
be scheduled by I and O. If packet transfer 
time is tq and computation requires time tr 
for each packet, then for maximum throughput, 


m > [ (te + 2tq) / tg J for tg > 0 
If t¢ is random with a significant 
variation, a more complex structure and 
scheduling algorithm might be used. Data 


packets can be buffered as shown in figure 4 
with the scheduling controlled by the state 
of the buffers. For the input controller, 
let b;s be the Boolean state signal defined 
previotsly. Then a ~ good scheduling 
algorithm might be 


n = min(j) such that b5=0 
where n, if defined, is the subscript of the 
next buffer q to receive a packet of data. 
Similarly, if a, is the Boolean state signal 
for the jth output buffer, then let 


k = min(j) such that a ,=1 
else if no a.:=l, 


k = mint3) such that r. 


; is not empty. 


For maximum throughput, 
m> [ E((tg + 2tg) / tg) ] 


Note that the order of packet input is 
not preserved at the output for the case of 
random scheduling. If order must-~be pre- 
served, then an order index can be attached 
to each packet at I. This index can be used 
by O to place the results in order. Let 1 be 
the maximum number of packets containable by 
the system in figure 4, Let u be the maximum 
possible computation time, and let v be the 
minimum. Then three buffers of size w are 
required for reordering where 

w= [(1lu) /v] 

For the random scheduler, a data flow 
simulation is planned that will be similar to 
that for the pipeline structure. Several 
executive routines for phased and random 
schedulers, with and without order indexing, 
will be written. 

VI. Summary 

The Micromodules project is directed 
towards the simplification of hardware design 
and implementation. A powerful and flexible 
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Figure 1. A Pipeline Structure 
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9; = g5 (in1,£; (x)) iff (b.=1) and (i>0) 
(i,x) otherwise 
Figure 2. A Self-Optimized Pipeline Structure 
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G5 (x) = f(x) (0<j<mtl) 


Figure 3. A Parallel Processing Structure 
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Figure 4. A Self-Optimized Parallel Processing Structure 
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set of microprogrammed modules is provided. 
The use of a standardized interconnection 
discipline, with an emphasis on the elimi- 
nation of electrical errors, allows the 
engineer to concentrate on the architectural 
aspects of his problem. 

The system designer will have two 
powerful structures at hand: the pipeline and 
parallel schedulers. He can design micropro- 
grams for the functions to be computed. 
Using the computation time characteristics of 
the function microprograms, he can simulate a 
data flow model and manipulate the model to 
achieve the desired throughput. Finally, he 
can assemble an arbitrary network of modules 
without bearing the burden of synchronization 
and buffering design. 

A basic family of four micromodules is 
now in the development stage. Future work 
will include the identification and realiza- 
tion of other useful structures, whether 
microprogrammed or hardwired. A continuing 
effort to construct ROM designs with broad 
applicability and to further improve design 
aids is anticipated. Some effort will also 
be made to discover other basic system 
structures which would be useful in 
distributing processing tasks among a 
collection of modules. 
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ABSTRACT 


This paper describes the architecture of Computer Modules, 
or CMs. They are large digital modules of about minicomputer 
complexity that are specifically designed to take advantage of the 
rapidly advancing semiconductor technology. These modules are 
intended to be interconnected into systems that implement a wide 
range of computational structures. The main features of a CM 
include a small processor as the primary control element and 
memory distributed among the CMs in the system rather than 
centralized into memory modules as in current multiprocessors. 
CMs are interconnected into a network via buses that each have 
their own virtual address space to facilitate efficient inter—module 
memory sharing. This paper includes an ISP description of the 
address translation mechanisms as well as a discussion of several 
important implementation issues such as the avoidance of 
deadlocks in CM networks and the width of the inter -CM buses. 


1. INTRODUCTION 


This paper describes a set of digital modules that is being 
developed to exploit continuing advances in semiconductor 
technology and to enable the construction of high performance 
computer structures. These large digital modules, called Computer 
Modules or simply CMs, were introduced in two earlier papers 
[3,10]. These papers described some of the fundamental ideas 
that form the basis for our current research. Here we take a 
more detailed look at Computer Modules. Their architecture, as 


well as some important implementation issues, are discussed in 
depth. 


In the past 15 years, standard module sets have evolved from 
circuit elements, to gates and flip-flops, and to register —transfer 
level modules (i.e., MSI packages) [11]. Continuing advances in 
semiconductor technology led us to enlarge the scope of our 
earlier work in register — transfer level modules [4] to include the 
study of “larger” digital modules that can exploit the emerging LSI 
components. Although Computer Modules originated in an effort 
to “scale up" the results of work on register —transfer modules, we 
have also learned much from “scaling down" some of the 
principles of multiprocessor systems currently under development 
[1, 13, 16]. While Computer Modules can be used to implement a 
general purpose computation facility, they are primarily intended 
for special purpose systems. The topology of the network will 
reflect the interprocessor communication requirements of the 
application. 


wn ‘we tee weet ower eee ane 


&#This research is supported by National Science Foundation Grant 
GJ 32758x., 


The following description of Computer Modules is divided into 
two main sections: their architecture and their implementation. 
The architecture centers around the three types of address 
spaces used in CM networks and the features that enable a CM 
network to achieve a "tighter -coupling" between processors and 
memory than is possible with other multiprocessor or 
multicomputer organizations. Section 3 focuses on the more 
important implementation issues and presents specific solutions 
for CMs based on these considerations. 


2. THE ARCHITECTURE OF COMPUTER MODULES 
2.1 Fundamental Characteristics of Computer Modules 


The primary control element of a Computer Module is a 
programable processor. This processor may be microprogramable 
to allow tuning to particular tasks. Microprogramable processors 
are versatile control elements and are currently used in such 
diverse systems as genral purpose processors, I/O processors 
and controllers, device controllers and special purpose language 
processors. The use of small (microprogrammed or otherwise) 
processors as a _ primitive component is rapidly becoming 
economically feasible. Already a number of small processors exist 
in a single LSI package or a small number of packages [11]. This 
is in contrast with systems built with MSI components which must 
revert to small-scale integration, i.e., gates and flip-flops, to 
implement control primitives. Several attempts have been made 
to develop a control element for register transfer level modules 
[4, 5, 15], but they lack general acceptance. Primarily, this is 
because they have not been produced by any semiconductor 
manufacturer as an MSI component. 


While there certainly exist many applications where a.single, 
small processor is sufficient, it is clear that one processor cannot 
provide enough computational power to implement the range of 
high performance computing systems that are needed. If we hope 
to exploit LSI technology, some effective means must be found to 
interconnect a. number of small processors into a network. The 
inter ~module communications mechanisms must provide for fast 
communication with a maximum potential for concurrency. In Sec. 
2.2 we describe in detail the scheme for interconnecting CMs. 


The overall structure of a Computer Module is shown in Fig. 
2.1%. It consists of a processor, Pc; a.local memory, Mp3 a number 
of ports, K.maps, which allow interconnection to other CMs; and an 
intra -CM switch, S(processor, bus, and memory) or simply S.pbm, 


* The PMS notation of Bell and Newell [2] is used throughout this. 
discussion to describe the organization of Computer Modules. 
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which allows the Pc or any port to communicate with the Mp or 
any other port. 


It is useful to contrast the "classical" multiprocessor structure, 
in. which an array of processors have access to a large 
homogeneous shared memory via a switching mechanism, with a 
CM network (see Fig. 2.2). The classical multiprocessor memory 
is homogeneous in the sense that memory access time is uniform 
for any word within a processor's address space. All memory 
accesses incur the delay of a single level of switching. In a CM 
network the memory is structured to an arbitrary number of 
levels. Access time to the local memory of a CM by the processor 
incurs effectively no switching delay. A processor accessing 
memory in another CM on a common inter-CM bus incurs two 
levels (two Kmaps, see section 2.2) of switching delay. Accessing 
memory via an intermediate CM incurs four levels of switching 
delay, etc. (In a Computer Module network and some other 
multiprocessor systems, a processor need not wait for a write 
operation to be completed before proceeding with the next 
memory reference. Thus, for isolated references or when there is 
little contention for inter -CM buses, write operations to remote 
memory will appear no slower than write operations to local 
memory.) 


The nonhomogenity of memory access time allows program 
locality ‘to be used to advantage. CMs are primarily intended for 
special purpose applications, where a process can be bound to a 
processor and little or no multiprogramming is necessary. The 
frequently used code and data for a process is placed in the local 
memory of the CM which will execute it [13]. If code and data 
are common to several processors, but infrequently. used, only 
one copy of each need exist. Other CMs on the same or adjacent 
inter -CM buses can access common items via the bus network 
when necessary. Thus, references within the primary. locality of a 
program will be honored faster than with a classical 
multiprocessor with homogeneous memory. References outside 
the major locality will be honored slower than with a classical 
multiprocessor. In general purpose computer systems, cache 
memory schemes can be used to exploit dynamically detected 
program locality . This is effective for read only code and data, 
but severe difficulties arise in multiprocessor systems if shared 
writable data is stored in a cache as two of more possibly 
different copies of the same data can be created. 


The switching mechanism, that provides fast shared-access to 
memory, may be a substantial proportion of the total cost of a 
classical multiprocessor system. This is the case with 
Carnegie —Mellon's C.mmp [16] where up to 16 concurrent 
accesses by 16 processors to 16 memory units are possible. 
Sharing N memories among N processors via a crosspoint switch, 
with a potential concurrency of N, requires on the order of N 
switching elements. If a distributed switch is used, e.g. as in the 
B.B.N. multiprocessor [13], on the order of N* cables are also 
required. Inter -—CM memory accesses will be relatively infrequent 
if advantage is taken of program locality. Hence the. full 
concurrency provided in a classical multiprocessor is unnecessary. 
Sharing N memories with N processors requires on the order of N 
switching elements when using a single bus. 


It is also useful to contrast. CM networks with. typical 
computer networks. Ina computer network, a processor is tightly 


coupled to its own memory and normally does not have direct 


access to the memory of any other computer in the network. 
Processes, running on different computers, cammunicate by 
exchanging messages which are routed by some combination. of 
hardware and software. Messages are usually long relative to: the 
word size of the computers. In the UCI Distributed Computer 
System [9], messages contain approximately 1000 bits. In CM 
networks, communication between processes can occur at the 
single word level and all.routing is done. by hardware over high 
‘performance. buses. In computer networks, inter process 


communication usually occurs only at a high level because a 
relatively long message must be assembled and then transmitted 
to a potentially geographically distant computer via relatively 
slow communication lines. 


In summary, in a classical multiprocessor system all 
processors are uniformly and tightly coupled to all memory. 
Processes can communicate on a word level. In a CM network 


processors are very tightly coupled to local memory and more 


loosely coupled to the local memory of other CMs. Processes can 
communicate on a word level with slightly. more overhead | 
(minimum of two switching levels) than in a multiprocessor. In a 
computer network processors are very tightly coupled to their 
own memory and very loosely coupled to other memory in the 
network. Processes communicate at a message level with 
relatively large delays. 


S.pbm 


Mp , ‘ : 
(time-multiplexed crosspoint) 


K.map[0] K.map[1] K.map[2] K.map[3] 


Pc 


Figure 2.1 The Basic Structure of a Computer Module. 


2.2 The Processor, Bus, and Memory Address Spaces 


The inter -CM communication is based on mappings between 
address spaces. The three types of address space in a CM. 
network are described in this section. To aid our discussion, Fig. 
2.2 depicts a small, but non-trivial CM network: CM[A], CM[B], 
CM[C] and CM[D] are interconnected via inter -CM buses L and M. 


The most obvious scheme to allow processors to share data 
and procedures in memory is to give them all the same global, 
linear address space. This naive scheme is lacking on a number of 
counts. The linear address space would have to be very large 
(274 to 93%. ) in order to handle many large, contemporary 
problems. Hence 32 bit addresses would have to be used 
throughout the network and the structure of the CM network 
would have to be coded into the memory access routing 
mechanisms. Instead, Computer Modules use a segmented address 
scheme [7, 14]. The processor sees a virtual address space, the 
processor address space, which is divided into a number of 
variable sized segments (currently there are 16 segments). The 
memory address space is simply the linear, physical address space 
of the local memory. In a single CM system the processor 
address space and the memory address space correspond to the 
virtual and. physical address spaces of standard virtual memory 
computer systems. 


Each inter -CM bus has a virtual address space. These bus 
address: spaces allow processors to access the memory of other. 
Computer Modules. For example, consider a memory reference by. 
the processor of CM[A] to the memory of CM[B]. Figure 2.2 


illustrates the mapping between address spaces that must be | 


done: K.map[A][0] (K.map[0] of CM[A]) translates an address in 
the processor address space of CM[A] into the bus address space 
of inter-—CM bus L. Now K,. map[B][{1] recognizes the address on 
bus L and translates the address into the memory address space 
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of CM[B]. Hence K.map[A][0] and K.map[B][1] have been used 
to map memory requests from the processor of CM[A] to the 
memory of CM[B]. The establishment of this addressability is 
discussed in detail in the following section. 


CM[A] CM[B ] 
Mp - S.pmb | Mp — S.pmb | 
K.map[0] K.map[{1] K.map{0] K.map[1] 
1 : 


inter-Cm bus L K.map[2]. 


Mp _| S. pmb CM[C] 


K.map[0] K.map[1] 


inter~Cm Bus M 


Figure 2.2 A system of four Computer Modules. 


2.3 Routing Requests between Address Spaces 


The K.maps of Fig. 2.1 each contain a segment table which 
specifies the mappings to be performed from one address space 
to another. When a K.map recognizes an address on the inter -CM 
bus it performs an address translation. The switch, S.omb, routes 
translated addresses to either the local memory or to one of the 
three K.maps connected to inter-CM buses. The routing is 
specified for each segment in the segment tables. When a K.map 
receives an address from the switch it requests the the inter ~-CM 
bus. The address is subsequently placed on the bus without 
further translation. 


To return to our example of Fig. 2.2, the processor of CM[A] 
is able to write a word in the Mp of CM[B] if the segment tables 
in Kesmap[A,O] and K.map[B][1] are set correctly. K.map[A][1] 
performs bus arbitration but does not translate the address. This 
ability of K.maps to route single word-memory. access requests, 
independent of the processors is an.important aspect of the CM 
architecture. It ensures a more closely coupled structure than is 
possible with computer networks that transfer data under the 
control of a communication or message processor. 


Figure 2.3 illustrates another important property of CMs, their 
use as a switch between inter-CM buses. In Fig. 2.3 the 
appropriate address spaces are shown for Pc[A] (the processor of 
CM[A] in Fig. 2.2) to access a word in Mp[D] (the memory of 
CM[D]). K.map[A][0] maps the request from Pc[A] into the bus 
address space of inter -CM bus Ls; K.map[C][2] maps the request 
from inter ~CM bus L into the bus address space of inter -CM bus 
M; and finally, Ksmap[D][1] maps the request into the memory 
address space of Mp[D]. An alternative fo this scheme is to have 
a second module type. It is more efficient, however, to take 
advantage of the address translation and bus interface logic 
already provided within a CM than to duplicate it with a special 
purpose switch module. The ability to automatically route 
single -word memory accesses to any memory in a network is 
crucial if networks of CMs need to interact in a closely —coupled 
manner. Transfer of blocks of information is also important in 


many applications. Special hardware is provided so that the 
processor may initiate a block transfer and then continue program 
execution. This allows a higher data transfer rate and more 
productive use of the processor than block transfers by program. 


Figure 2.4 illustrates additional ways the bus address space 
can be used. In Fig. 2.4(a) several CMs are set up to map the 
same bus segment into their local memories. This arrangement 
gives CM systems a broadcast, or one-to-many mapping ability. 
For example, CM[A] in Fig. 2.4(a), writing into a single location 
gends information simultaneously to the Mps of CM[B] and CM[C]. 
On the other hand, Fig. 2.4(b) shows how a CM system can 
implement a many ~to-one mapping. This arrangement is needed 
whenever several concurrent processes share a common data 
structure. 


Address Space Address Space 


of bus M of Mp[D] 


Address Space 
of bus L 


Address Space 
of Pc[A] 


Segment 


K.map[A][O] K.map[C][2] K.map[D](1] 


Figure 2.3 Address translation with a CM used as a switch. 
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Address Space 
Mp[C]} 


K.map[C][2] 
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Figure 2.4 One-to-many (a) and many-to-one (b) address mapping. 
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2.4 The Segment Table and Address Translation Mechanism 


The mapping logic within the K.map actually performs two 
distinct functions: address recognition and address translation. 
Address recognition and translation information is held in the 
segment table of each K.map. All the access modes described in 
the previous section are achieved by appropriate segment table 
entries. This section describes in detail the address translation 
mechanism. It is also concisely defined, in ISP notation [2], in the 
appendix. 


The segment table consists of a set of segment descriptors, 
where each descriptor defines how one segment in the source 
address space is mapped into a segment in the destination 
address space. A segment descriptor is composed of three fields: 
the source segment name, the destination segment name, and the 
control and status field. One of the subfields in the control field 


is the logarithm, base 2, of the segment size. Hence, only segment 


sizes that are powers of two are allowed: 1, 2, 4, 8, .., 4K. By 
restricting the segment sizes in this way we avoid the addition 
implicit in more conventional base/limit translation schemes. 


Figure 2.5 shows the essential properties of the mapping 
function performed by the K.map mapping logic. The four. most 
significant bits of the source address are used an an index into 
the segment table to select a row, or segment descriptor. The 
segment size field, of the selected descriptor, specifies the 
position of the boundary between the segment name and the 
displacment fields in the source address. The segment name field 
of the source address is compared with the segment name in the 
descriptor. If they match, the destination address is generated by 
concatenating the destination segment name with the displacement 
field of the source address. This mechanism ensures unique 
recognition of segments provided that each segment is allocated 
at a base. address that is divisible by its size (which is a power of 
2). 


For a write operation it is reasonable that two or more 
K.maps, on the same inter -CM bus, respond to the same address 
(c.f. section 2.3 and Fig. 2.4 (a)). For read operations to the 
same address it is necessary to ensure that only a single K.map 
responds. Thus, apart from conventional protection issues, ‘it is 
necessary to be able to specify that a segment is read protected, 
It is also possible to entirely disable a segment descriptor, i.e. 
prevent it from ever matching. 


All the segment descriptors are in the local memory address 
space. This circumvents the need for the processor to have 
special instructions to maintain the segment descriptors and also 
provides a clean protection mechanism for the descriptors. By 
suitable setting of the segment descriptors they can be made 
available to a remote CM for access via one or more inter-CM 
buses. The local Pe can relinquish its ability to access its own 
segment descriptors. Thus centralized control of inter-CM 
communication is possible and faulty Pcs can be effectively 
removed from a network. 


2.5 Coordination of Control Between CMs 


Inter -CM coordination is a direct extension of the memory 
sharing mechanism described in the previous section. Any 
segment, which maps to the local physical memory address space 
of the CM, may be specified as a control segment. An attempt to 
write. into a control segment causes an interrupt to the processor 
and control is transferred to that .effective address in the 
processor address space. The contents of memory is not altered. 
However, the data part of the write operation is saved and can be 
used as a parameter to the interrupt routine. {tt could contain the 
identity of the interrupting CM, the value of a parameter, a 
pointer to a list of parameters, etc. In multiprocessor systems it 
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Figure 2.5 Address recognition and translation by a K.map. 


is usually necessary to provide synchronization primitives, e.g., 
the P and V operators of Dijkstra [8]. CMs allow the 
implementation of these synchronization primitives. 


2.6 CMs as Modules. 


There are many aspects to the design of digital modules: 
interconnection rules, number of external pins, individual 
performance, etc. Here we are concerned with the extensibility of 
a network of Computer Modules. In this context, extensibility 
applies to several dimensions: address space, total memory, total 
processing power and total data transfer rate. The address space 
of an individual processor is, of course, limited by its internal 
address size. The address space can be expanded by reloading 
the segment registers that provide addressability to the local 
memory of other CMs via the inter-CM buses. Further 
addressability can be achieved by altering the segment registers 
of intermediate CMs used as switches. Thus CMs can be arranged 
in tree structures to give addressability to an arbitrarily large 
memory (with potentially considerable delays). 


Memory can be added to a CM network both by increasing the 
local memory of each CM and by adding extra CMs. Similarly, an 
arbitrary number of CMs can be added to give extra processing 
power. Extra processing power in this form is useful only if the 
task can be partitioned to execute in parallel on the extra 
processors. 


The most direct method of adding CMs to a network is to 
extend an existing bus. There is no limit, at least conceptually, to 
the number of CMs per bus. However, since the maximum data 
transfer rate per bus is fixed, there is a limit to the useful 
number of CMs per bus. It is usually. possible to increase the 
overall data transfer rate on a bus (with more than three 
communicating CMs). by grouping the CMs by. their frequency of 
interactign, dividing the bus in two, and. inserting a CM as a 
switch. 


3 MAJOR IMPLEMENTATION ISSUES 
3.1 Address Translation Logic and the Intra-CM switch. 


lt is important to find an efficient and economic 
implementation of the ports and the internal switch of a CM, since 
their complexity may exceed that of the processor. The intra-CM 
switch, S.pbm, could be implemented as a cross—point switch to 
provide maxiumum concurrency. However, consideration of 
locality indicates that a large majority of the traffic through the 
switch will be from the processor to local memory. Less traffic 
will pass between the Pc and the inter-CM buses and from the 
inter -CM buses to memory. Normally only a small fraction of 
traffic will pass from one inter -CM bus to another. This suggests 
that there would be little performance loss if the switch had a 
concurrency of one. By the same argument, much of the address 
translation logic within each K.map can be centralized into a single 
shared unit. Full centralization of the address mapping logic may 
degrade the performance of the inter -CM buses by increasing the 
effective address  recognition/rejection — time. Partial 
centralization, however, provides significant hardware savings 
while minimizing the effect on performance. K.rnap[O] may be 
treated as a special case with accesses to local memory by the 
processor being treated in a simpler, faster manner than accesses 
to remote memory. 


3.2 Deadlock with Inter-CM Memory Access 


In. a CM network it is clearly necessary to ensure that 
deadlock does not occur with inter-CM memory access. A set of 
processes is defined to be deadlocked[12] when no process can 
proceed without acquiring a resource already held by another 
process within that set. The necessary conditions for deadlock 
are: resources must not be sharable or pre-emptable, resources 
must be retained while a process is acquiring further resources, 
and there must be a circularity in the resource requirements of 
the processes. 


Referring to the four CM network of Fig. 2.2, consider mutual 
memory accesses between CM[A] and CM[D] where CM[C] is used 
as a switch and may be executing a program independant of the 
communication between CM[A] and CM[D]. An address generated 
by the processor Pc[A], which is intended to reference the local 


memory of CM[D], will be translated by K.map[A][0] and passed to 


K.smap[A][{1]. When the K.map[A][1] has acquired control of 
inter -CM bus L the address will be placed on it and then 
recognized and translated by K.map[C][2]. K.map[C]{1] will 
acquire control of inter-CM bus M and the address will be placed 
on it. This address will be recognized and translated (for the 
third time) by K.map[D][{1], and is now used to access a word in 
Mp[D]. This method of implementing inter-CM memory access 
where inter -CM buses are acquired in sequence and relinquished 
in reverse order we call circuit switched. 


Consider the consequences of concurrent mutual memory 
requests by CM[A] and CM[D] in Fig. 2.2. It is clear that a 
situation may arise where Pc[A] holds bus L and Pc[D] holds bus 
M. Unless one of the processors can be forced to reliquish a bus, 
neither memory access can be completed and the network is 
deadlocked. In this simple network the impasse will be clearly 
evident at CM[C] and one request may be chosen arbitarily for 
pre-emption, thus resolving the deadlock. Deadlock can occur, 
even with a trivial two CM network, if they are implemented with 
an internal bus which is common to the processor and the local 
memory, e.g. the PDP—11 Unibus[6]. To avoid deadlock it is 
essential that a processor be able to make an external reference 
while its own.local memory is being referenced. 
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Figure 3.1 Deadiock, indistinguishable from congestion, with circuit 
switching. 


In networks where there is more than one possible access 
path between any two buses, deadlock may occur without any 
single K.map or CM being able to detect it. Figure 3.1 shows a 
deadlock situation which is manifest at two K.maps in the network. 
Without global information neither of the K.maps can distinguish 
the situation shown from a condition of simple congestion. A 
timeout mechanism could be used after which an incomplete 
access attempt is pre-empted and later retried. This mechanism 
would be very inefficient. There would remain a possibility of 
recurrent deadlocks by conflicting access requests since there is 
no way to avoid the conflicting access requests being pre-empted 
approximately simultaneously. With circuit switching, in an 
arbitary network, the only way to guarantee freedom from 
deadlock is for each request to carry a unique priority. This 
would ensure that one request is able to complete when an 
impasse occurs. 


If memory access between computer modules can be 
implemented without a request holding more than a single bus at 
any time then deadlock over the allocation of buses can be 
eliminated. This requires that information which defines the 
memory request be buffered at each CM or K.map on the access 
path. For read operations it also requires that the buses which 
comprise the access path be reacquired* to propagate the data 
back to the requesting processor. This type of inter -CM memory 
access implementation we call element switching. An element is 
the information that defines a memory access request (address, 
control signals and usually data). An element is analogous to a 
message in a computer network but is considerably shorter. 


Although element switching eliminates deadlock with respect 
to the allocation of buses it introduces the possibility of deadlock 
over the allocation of element buffers. Provision of one buffer 
per access path through a CM is sufficient to guarantee freedom 
from deadlock. (This property, and other aspects of the deadlock 
situation, are the subject of continuing investigations.) Element 
buffers are allocated by associating a distinct buffer 
(approximately 40 bits) with each segment mapped .by a K.map. 
This buffer allocation mechanism is sufficient to enable all possible 
access networks to be implemented without deadlock provided 
sufficient segment descriptors and/or CMs are available . 


* The read element marks the access path on the forward journey. 
This enables the’ address used in the forward direction to 
determine a unique return path to carry the data referenced to 
the requesting processor. 


Element switching provides better bus utilization and hence 
alleviates inter -CM bus contention. If there is no bus contention, 
element switching will increase the time overhead in making a 
read access to a remote memory over. a corresponding circuit 
switched implementation. The extra time overhead is incurred in 
reacquiring the buses to deliver the data to the requesting 
processor. 


3.3 The Width and Nature of the Inter-CM Bus 


The nuraber of external pins required to interconnect 
Computer Modules will have a significant impact on their cost. A 
related factor is the amount of heat dissipated.when driving an 
external line. Power dissipation may be the dominant factor 
limiting packing density for an LSI implementation of CMs. 


While reliability and economic considerations lead to narrow 
buses and few pins, performance considerations clearly imply that 
the inter -CM buses have a high data bandwidth which is 
facilitated by wide buses. Both the overall potential maximum 
data tranfer rate and the response time for individual read 
requests (assuming no contention) are important measures of the 
inter -CM bus performance. 


Minicomputer system buses usually have distinct lines for 
each function (address, data and control) and are fully interlocked 
on a word=-by-word basis. The absence of time division 
multiplexing of the information carrying lines allows for a high 
data-transfer rate and a minimum of complexity in devices 
interfaced to the bus. Interlocking provides reliable operation 
over a range of bus lengths and loading conditions. Analysis of 
bus handshaking protocols shows that time multiplexing of the 
information lines between address and data imposes considerably 
less than the two to one time overhead expected. Half -—width 
buses retain the inherent reliability of full interlocking while 
significantly reducing the number of pins and cables required for 
bus connections. 


Further reduction in the number of lines per bus may be 
achieved by increased time multiplexing of the information lines. 
To maintain data transfer rates comparable with a full or 
half —width bus the full interlocking on each bit must be sacrificed. 
We are investigating schemes which employ a total of 4 to 10 
lines per bus. Address and data information is tranferred as self 
clocking pulse trains down each line. Bus control functions are 
performed using the same lines that carry information. 


4, CONCLUDING COMMENTS 


Parallel Algorithms. Computer Modules are intended to facilitate 
the implementation of parallel algorithms. However a general 
solution to the problem of decomposing a task into efficient 
parallel processes is not near at hand. Nevertheless, there exist 
parallel algorithms for some important problems and there are 
many applications where the task is presented in a decomposed 
form. For instance, most process control applications consist of a 
number of specialized control tasks with communication at. a 
higher level occurring infrequently. 


Project Status. Currently research is proceeding on two fronts. A 
detailed simulation of a CM network is being written. Particular 
emphasis is being placed on the effect of interactions between 
external memory accesses so that the effects of bus and memory 
contention can be accurately assessed. A number of applications 
will be run on the simulation with a wide range of CM 
configurations. Concurrent with the development of the simulation 
a small number of CMs will be built based on existing MSI 
components and commercially available processors. These will 
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provide performance figures for the simulation and demonstrate 
the technical feasibility of the design. 
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Appendix: ISP Description of the K.map Address Translation 


K.map State 


Segment_Descriptor\SD[0:15]<41:0> 
The 16 descriptors in the segment table define the mapping 
between the source address space and the four destination 
address spaces. 

Source_Segment_Name\SSN sz SD<¢41:30> 
Name of source segment, 4 high order bits of name are 
given by the position in the table. 

Destination_Segment_Name\DSN 
Name of destination segment. 

Control_and_Status_Field\CSF<13:0> zx SD<13:0> 

Log_ Segment _Size\LSS<3:0> se CSF<¢13:10> 
Size of segment in both source and destination address 
spaces is 2TLSS. LSS < 12. 

Mask<11:0> 
The unary encoding of LSS. 

Destination_Address_Space\DAS<1:0> 2 CSFK9ta) 
Designates the address space of the result: memory or the 
three inter-Cm bus address spaces. 

Descriptor_Active z= CSFC7> 


Set to allow translation with this segment__descriptor. 


ss SD<29:14> 


sx LSS J 1 


Write_Protect t= Gor <6) 
Block. translation of write requests. 
Read_Protect s=_ CSF <5 
Block translation of read requests, required when 
one-to-many address mappings are used. 
Referenced z= CSF<¢4> 
Set whenever segment__descriptor is used. 
Changed t= CSF<3> 


Set whenever the segment descriptor is used for a write 

request. 
Control_Segment z= CSF<2> 

Force interrupt on a write attempl to this segment,see text. 
Interrupt_Enable z= CSFC1> 
Interrupt Pending z= CSF<O> 


Interrupt_Data_Buffer ¢<15:0> 
Register to hold parameter (data part of write operation) 
of an inter-module interrupt request. Only a single buffer 
is necessary because subsequent interrupt requests queue 
in the element buffers which are not visible to the 
programmer. 


Address Translation Process 


x<€15:0> 
Alddress of request in source address space. 
Index <3 :0> seek C5212) 
Field<11:0> z= xC11:0> 
y[0:3]<15:0> 


Result address in one of four destination address spaces. 


Descriptor_Active[Index] A 
Test that indexed descriptor is active. 
(Field A -Mask[Index] = SSN[Index] A -Mask[Index]) A 
Test source address matches source segment name. 
(-Write_Protect[Index] v {x is read request}) A 
(-Read_Protect[Index] v {x is write request }) 
Check protection. 
=> ((-Control_Segment[Index} v {x is read request }) 
=> y[DAS[Index]] © (Field A Mask[Index]) v 
| (DSN[Index] A -Mask[Index])); 
This is the translated address. 


* This ISP has been simplified for clarity. 
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((Control_Segment[IAdex] a {x is a write request}) > 
Test for interrupt. 
( Interrupt_Data Buffer ¢ data; 
Save data from write request. 
Interrupt_Pending © 1 ; next 
Interrupt_Enable = {interrrupt Pc})). 


Block Tranfer Mechanism State 


Source_Address¢15:0> 

Start address of source block. 
Destination_Address<15:0> 

Start address of destination block 
Block_Transfercontrol\BTC 


Enable_Transfer ss BTC<15> 
Interrupt_Pending z= BTC<14> 
Interrupt_When_Complete z= BTC<13> 
Error z= BTC<12> 
Word_Count <11:0> ss BTC<11:0> 


The block transfer mechanism operates within the Pe address 
space. 


A MICROPROGRAMMED ARCHITECTURE 
FOR FRONT END PROCESSING 


Rodnay Zaks 
Universite de Technologie de Compiegne, France 


INTRODUCTION 


The increasing diversity of hardware devices and software 
procedures developed for remote processing has yielded a 
multiplicity of new facilities and telecommunication network 
structures. The corresponding architectures for front-end 
systems range from specialized device control to sophisti- 
cated multi-purpose multi-terminal support. Simultaneously, 
the very complexity of new data transmission and processing 
techniques has created a need for flexible and powerful yet 
transparent communications processors. The functions of a 
front-end system will be analyzed in order to derive the 
concepts which will be used to establish a classification. 


In a first part, telecommunications and control functions are 
analyzed in detail, as well as the user facilities at the 
functional level. From these concepts, two types of practical 
classifications are evolved. A global classification charac- 
terizes the front-end system from the operating system's 
standpoint. A local classification characterizes it as a local 
device, in fuction of its service capabilities to the user. 

This dual system allows a simple classification of a given 
device and hence a simpler comparison with others in its 
class. 


The level of support provided by major commercial operating 
systems with respect to front-end systems is then examined. 

Their facilities and shortcomings provide the basis for front- 
end systems. 


Finally, the architecture of a commercial microprogrammed 
front-end processor is presented. A modular micropro- 
grammed architecture of this type allows efficient and eco- 


nomical system structuring for a wide range of teleprocessing 


services. 


FRONT-END SYSTEMS FOR TELEPROCESSING 


The baste functtons of a telesystem 


The telecommunications device, ranging from a simple 
hardwired controller to a large programmable processor, is 
coupled to the host processor's operating system via a 
software communications system. This global system per- 
forms all the telecommunications functions. It can be called 
the telesystem. According to the distribution of functions 


from the host processor to the front-end device, it becomes 
possible to classify front-ends into logical categories. Such 


ac 
fea 


lassification makes a cost-performance analysis then easily 
sible. In order to establish this classification, the func- 


tions of a telesystem are now considered. 


The minimum functions are: 


as 


aan o 


. transmission initiation, control, and completion. 
. data assembly into required structures: from bits to words, 


blocks, packets, or messages. 


. code conversion according to the device, and/or host 


processor code. 


. error checking and recovery, possible multiple transmis- 


sion, error logging. 


. recognition of control characters and special markers. 

. line monitoring. 

. message routing: to/from device or host processor. 

. bookkeeping procedures associated with beginning and end 


of transmission. 


More complex functions which may reside at the remote 
station are: 


Hq CO 


oc =] 


. line discipline and control: procedures for dial, polling 


or loop systems. 


. queuing: in a multiple-device or multiprogramming 


environment. 


. dynamic buffer allocation. 
. message editing and compaction. 
. local network traffic control: routing to appropriate device 


or process. 


. communication line concentration or multiplex. 
. priority scheduling. 
. logging. 


In addition, the following specialized functions may be 
provided: 


Ey 
2. 
3 
4 
5. 
6 
7 
8 
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fail-soft: automatic recovery from transmission or 
system errors or failures 
-on-line diagnostics 


. on-line operator dialogue 
. specialized data compaction 


input validation 


. screen regeneration 
. graphic manipulation 
. control of specialized units 


SOURCE CAPTURE RECORDED 
I DATA DATA 


DATA OR 


MESSAGE 
DISPLAYED 


OR STORED 


ILLUSTRATION 1. 


9. security and access procedures 
10. stand-alone capabilities. 


Distingutshing front-end configurattons 


Front-end systems perform an interface role between the 
user and the host's operating system, or the telecommuni- 
cation procedure. Each of the functions described in the 
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preceding section can be implemented in hardware, firmware, 


or software. Further, most of these modules may reside 
either within the host system, or anywhere on the line 
between the operating system (or the I/0 port) and the user. 


This results in a variety of architectures since,once logical 
functions are assigned to physical or software modules, 
these modules may in turn appear as an arbitrarily complex 
front-end system. The resulting complexity does not facili- 
tate a logical classification of such systems. Since the 
physical distribution of logical modules onto hardware sup- 
ports may vary widely, and still accomplish similar func- 
tions, it appears practical to classify these systems by the 
level of service provided. 


Two functional classifications will be made, characterizing 
the system either by its appearance to the host processor's 
operating system (global classification) or by its level of 
service to the user (local classification). 


A global eclasstftcatton of front-end systems 


When viewed at an operating system level, the front-end 
system is characterized in its global environment, and its 
capabilities are tied to the host operating system's capa- 
bilities that it supports or enhances. The notion of operating 
system becomes a global concept which may be embodied in 
one or more processors. It is assumed here that a signif- 
icant portion of the operating system resides in the front- 
end. : 


The four essential modes are: 
1. real-time. 


This applies to all cases where the front-end system can 
react in real-time to a modification of its environment. 
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THE STEPS OF DATA ENTRY 


It includes in particular remote process control, where 
the host system is viewed generally as a data base. 


2, time-sharing 
Where it becomes uneconomical to have a large number 
of slow devices interfacing directly to an I/0 channel of 
the host processor, a front-end system may provide the 
desired interface functions. As in the case of real-time, 
this does not preclude the system from performing spe- 
cial functions prior to communicating with the host. Such 
functions may include: editing, formatting, code conver- 
sion, preprocessing, or even pre-compiling. 


3.data collection 
This includes all cases where the front-end system appears 
to the host's operating system as a collection of 1/0 
devices, capturing and managing the data flow. This 
includes in particular remote-batch processing (program 
entry) as well as data entry, inquiry, and update. Data 
may be captured and collected by a variety of devices, 
including special-purpose devices. An essential role of 
the front-end system is then to look like a standard host 
1/0 device. 


4,packet switching 
In this role, the front-end performs the automatic routing 
of blocks of information, whether messages or packets, 
between software or hardware modules. This implies 
elaborate scheduling facilities within the front-end, with 
queuing, dynamic buffering, and corresponding facilities 
within the next processor's telecommunication module. 


A local classtficatton of front-end systems 


The front-end system is characterized in function of its 
specific capabilities to the user, as a local device or 
service. 


The three main types are: 


1. Emulator 
The emulator is a plug-to-plug compatible device re- 
placing an existing manufacturer's controller. Typical 
applications are IBM 2700 or 3700 series emulators. 
These may be hard-wired, micro-coded, or software- 


coded devices. Hard-wired devices usually offer cost 
savings advantages, while firm- or soft-coded implemen- 
tations allow more flexibility. This flexibility may be 
used for the same device to implement several emulators 
(communication with different host processors). It can 
also be used to offer extra services, in addition to the 
strict emulation capability. As the range of extra services 
increases, a second type can be characterized: 


. Intelligent Emulator 
With diminishing hardware costs, and an ever increasing 
demand for varied user services, emulator devices tend 
to increase in sophistication and offer a new range of 
services, previously not available on the device they 
replace. An intelligent emulator may offer extended user 
or operating system dialogue facilities, fail-soft capa- 
bilities (message accumulation in case of host processor 
failures, warning to the users, orderly recovery proce- 
dures), extended terminal support, local message routing 
between terminals (listing cards on the local printer, and 
onwards to more elaborate message switching), limited 
data validation. A general rule is to consider as an 
intelligent emulator a front-end device whose basic func- 
tion is emulation, and where standard functions have been 
extended or improved, or where service facilities have 
been added. As more general facilities or functions 
become added, a third type must be introduced: 


3.general front-end processor 
Such a system can be characterized by the fact that it 
still appears as a single device to the host's operating 
system, yet performs a range of services functionally 
distinct from emulation. Such a system may present ad- 
vantages at two levels. At the user's level by offering 
services not previously available with the host manufac- 
turer's equipment. At the system's level, by incorpo- 
rating many or most of the functions previously handled 
by the host processor and/or specialized equipment or 
modules. 


Typically, the front-end processor will incorporate a 
resident real-time operating system and handle most 
telecommunications functions: line polling, device de- 
pendencies, queuing, buffer management, code conver- 
sion,error recovery, multiple transmission. "Intelligent" 
front-ends provide in addition special software capabili- 
ties for specialized data capture and validation, field 
checking, applications packages, store-and-forward mes- 
sage switching. The linkage between the front-end 
processor and the host's operating system may then 
simply consist of a front-end control program, with 
nearly all telecommunication functions delegated forward. 
Some front-end processors even provide a stand-alone 
capability with high-level languages (FORTRAN, COBOL) 
available for local execution. This may be an attractive 
solution as a back-up, in case of host malfunction, or a 
valuable local service (night utilization). 


The flexibility and power itself of a general front-end 
processor implies the need for operating software facili- 
ties. At a minimum, host-resident facilities should in- 
clude the following functions: 


(a) front-end cross-assembler 

(b) load module, allowing to load the front-end system 
from its host 

(c) transfer module, allowing to dump core, or transfer 
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information from front-end's storage onto a host 
device 

(d) network configurator (macroprocessor) 

(e) file system facilities for front-end library programs. 


Operattng system support 


The various degrees of support afforded by major operating 
systems are examined here. The global classification of 
front-end systems has already introduced three main types 
of support: real-time, time-sharing, data collection, and 
message switching. Important subtypes are remote job 
processing, inquiry, and transaction support: access to 
large data files for update (write) or query (read). 


Other facilities which may be included in the telecommu- 
nications module, interfacing to the Operating System,are: 


1. message control. 
It provides routing facilities between the user's program 
and the telecommunications facilities. In basic access, 
messages are simply routed to their destination point, 
without simultaneous multi-access. In queued access, 
the module manages dynamic buffers and schedules 
transmission. 

. processing module. 
This user program may perform data collection and 
compaction, complex message switching, and on-line 
data file access. It may be equipped with specialized data 
processing packages. 


IBM OS support 


IBM's OS is one of the most widely interfaced operating 
systems and deserves a special analysis. IBM 360's or 
370's systems use one or more 2700 series transmission 
control units. S/370 may also use the 3700 series. 


The 2700 is a hardwired controller with three basic 
capabilities: 


1. character assembly/disassembly 
2. control character identification 
3. line monitoring (time-out inactive terminals). 


All communications control is accomplished in the CPU 
under OS (or, in a limited way, DOS). The communications 
module is BTAM, QTAM or TCAM ("Telecommunications 
Access Method"). 


BTAM is a simple package which provides elementary 
control functions for telecommunications lines. It is invoked 
in the user's program by the following two macros: WRITE, 
to send a message; READ, to receive data. BTAM is an 
interface module between OS and the user program. It does 
not provide any queuing. Any moderately complex applica- 
tion then requires other telecommunications control packages 
as additional interfaces. These will usually reside in high 
priority partition (time critical I/0 control). 


TCAM provides the queuing facilities. It includes a traffic 
scheduler, handles message switching, and can support a 
high degree of multiprogramming. It is invoked with the 
GET and PUT macros. 


For completeness, the channel control primitives are 
outlined. IBM's I/0 instructions, labelled CCW (Channel 


Command Words), perform three functions: data transfers, 
device control, branching within the channel program. Com- 
munications between the CPU and the channel are performed 
as follows: 


1. CPU command to channel (four types): 
(a) start 1/0. 
(b) test I/0 
(c) halt 1/0 
(d) test channel 


2. channel's interrupt to CPU. The main types are: 
(a) 1/0 
(b) programmer error 
(c) supervisor call 
(d) external 
(e) machine check 


Back to IBM control units, the 3700 series is a pro- 
grammable processor which provides 2700 emulation or 
front-end facilities under NCP. In that case, it implements 
part of TCAM: polling, error recovery, terminal dependen- 
cy, code conversion; the following functions remain in the 
host processor: user - OS linkage and message processing. 
Although the 3700 is programmable, it does not provide 
spectacular improvement. It is limited to 370's under OS or 
VS (no DOS), and does not support local peripherals. It must 
also be stressed that 360 OS MFT/MVT remote support is 
limited to remote batch initiation. 


CDC 6000 sertes SCOPE 3 


INTERCOM 1 provides interactive time-sharing and remote- 
batch processing support. It is limited to two types of ter- 
minals: teletype and CRT display. Elaborate file access 
and protection facilities are provided, as well as inter-user 
communications. In remote batch, commands issued from a 
terminal place a file on a batch queue for processing. 


DATA (0-7) 


Honeywell 200 Mod 4 OS 


The communications supervisor supports remote terminals 
like local peripherals. This allows remote-batch entry. In 
addition, query/reponse programs residing in the user's 
partitions provide interactive terminal communication. 


Burroughs 6500 MCP 


MCP (Master Control Program) provides elaborate commu- 
nications facilities through a data communizations processor. 
It supports remote computing, inquiring and time sharing. The 
message control system handles file maintenance and job 
control, and supplies message-switching capabilities and 
inter-user communications. Facilities provided include a 
variable number of remote stations, line monitoring, condition- 
al command processing (detection of exception conditions), 
initiation of object jobs as independent processes, and main- 
tenance of file and remote user security. 


Untvae 1108 EXEC 8 


Remote facilities provide concurrent or on-demand batch and 
real-time processing. The executive control language allows 
commands to be specified from a user's remote console in 
conversational mode. It supports paper-tape input and gene- 
ralized inter-process-communications. 


XDS SIGMA 5/7 BIM 


BTM (Batch Time Sharing Monitor) provides time-sharing 
access, remote batch initiation, file positioning. Only the 
operator may communicate with on-line users. 


Honeywell GE 600 sertes GECOS III 


GECOS III (GE Comprehensive Operating Supervisor) pro- 
vides remote-batch and time-sharing. In batch, other 


DATA (8 - 15) 


INTERNAL BUSS SSS 


MAIN 


MEMORY 


ALU 


CORE OR IC 


HLLUSTRATION 2. 


OPERAT 


170 


SPECIAL 


CONTROL 


REGISTERS 


THE DATA FLOW 


244 


terminals may be specified for output or messages. Direct (0.666 MHz vs. 4 MHz for the internal bus). In addition, a 


communication with a processing program allows direct number of special-purpose hardware modules may be in- 
inquiries. serted on the internal bus. Possible modules are: binary 

function generators, BCD arithmetic or conversion oper+ 
A FRONT END MICROPROCESSOR ators, string operators. This simple and modular struc- 


ture allows the possibility of shifting functions from soft to 
firm or hard (see illustration 2). 
A simple 16 bit parallel microprogrammed processor 


system, developed for teleprocessing in Europe, is de- The control structure of the machine appears on illustration 3. 
sribed here. In its basic version, it has been configured as The control bus includes: 
a multi-procedure intelligent emulator. Its design provides 1. 12 control lines emanating from the data register of the 
an illustration of the structural trade-offs involved in ob- control memory unit. They are gated to all the logical 
taining the required flexibility at minimum-cost in a fairly modules. They select a module and specify an operation 
well-defined environment. This front-end system usually code in 250 nsec. Whenever an operator module requires 
interfaces directly to a host operating system via an I/0 more than 250 nsec, another one may be accessed or 
channel. In its simplest version, it is transparent to the initiated, resulting in parallel execution. 
operating system, and provides teleprocessing services 2. 9 internal timing lines: clock, test for condition, inter- 
such as remote-batch. Due to the large variety of teleproc- rupt management. 
essing needs, its structure will have to evolve with time. It 
is commercialized under the name "Ordo 16'' (Société des The control unit organization appears on illustration 4. 
Ordoprocesseurs). The control memory is addressable in four-word blocks 
(12 bit words). Its contents define the instruction set for the 
A single internal 16-bit bidirectional bus connects all logi- teleprocessing application considered. It includes a standard 
cal elements. While limiting the internal transfer speed, and instruction set and specialized primitives, such as micro- 
reducing the possible overlap of microinstruction phases, programmed multiplexer channel-control, interrupt manage- 
this allows the simple insertion of specialized hardware ment, and communications control functions. 
functions as plug-in modules. A very short microword for- 
mat (12 bits) limits the possible synchronicity of micro- A real-time monitor performs task scheduling, and buffer 
operations, but uses a highly encoded vertical code to and queuing management. It also handles user communi- 
achieve complex arithmetic or logical operations in a single cations through a console, and provides inter-terminal 
microinstruction cycle (250 nsec). transfers. The system's flexibility is used to tailor its 
architecture to the application. In particular, identical 
The bus connects the control memory, the main memory, configurations can be interfaced successively to a number 
the ALU (Arithmetic Logical Unit), the scratchpad (a set of of host processors (multi-procedure facility). Intelligent or 
32 or 64 fast registers), and the I/0 controller. This inter- plain emulator versions allow the system to be plug-to-plug 
nal bus is bidirectional, half-duplex. I/0 modules commu- compatible with a large number of commercial hard-wired 
nicate with the I/0 controller through a slower external bus front-ends. It is also being used as concentrator, and for 
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the support of special peripherals. Capabilities under de- 
velopment are: general-purpose stand-alone facilities, 
multi-device file system, additional terminal support, 
user-microprogramming facilities, improved host-based 
programming aids. 


This type of microprogrammed front-end processor may 
implement many operating system functions, and facilities, 


of the host processor, as its "intellisence" level increases. 
p ’ g 


This may occur without any basic structural change, usu- 

ally by expanding the software facilities. This achieves 

dual cost benefits: 

1. An evolutionary front-end on a fixed hardware struc- 
ture. 

2. Reduced host processor's time consumption, as more of 
its functions get shifted to the front-end. 


The increase in intelligence of the terminal is particularly 
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CONTROL UNIT ORGANIZATION 


important: by smoothing the man-machine interface, it 
increases the human efficiency in using the front-end and 


‘the host processor systems. 


PROSPECTS FOR A MICROPROGRAMMED FRONT-END 
ARCHITECTURE 


The field of front-end processing has been expanding very 
rapidly. It has been shown how the increased complexity of 
a front-end system is handled in firmware, hardware, or 
software modules. Shrinking LSI costs will favor flexible 
micro-programmed systems, such as the one outlined here. 


Although many other criteria will eventually affect the 


marketability of such systems, such a modular and dynami- 
cally changeable architecture presents the best prospects 
for an efficient and flexible implementation. 
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ABSTRACT 


Binary-based and fixed-length structure computers 
are often inconvenient and wasteful of resources. In 
this paper we present a design for a fully variable- 
length structured minicomputer. Since all parameters 
(instructions and data) are unrestricted in length, 
their boundaries and interpretation are effected by 
special delimiter codes. For practical reasons (dic- 
tated by current technology) the machine utilizes a 
binary-coded decimal number representation. 


I, INTRODUCTION 


Present day digital systems show a prevalence of 
binary, fixed-length structures. This is dictated by 
the technological ease of implementation, low cost and 
high reliability. Yet there are a large number of 
applications where the binary base and fixed-length 
organization are inconvenient and often wasteful of 
resources, 

Decimal and variable-length data have been imple- 
mented in differing degrees from the IBM 1620 era [1] 
to one of the latest minicomputers, the CIP/2200 [2]. 
However, most of these machines have achieved these 
features in an "added-on" fashion in a structure that 
mainly offers conventional binary, fixed-length opera- 
tions. The recently reported B1700 computer [3] re- 
flects an attempt to get around the difficulties im- 
posed by fixed-length constraints by providing a highly 
flexible, reconfigurable structure, where specific 
lengths may be defined as run time parameters. 

In this paper we propose the design of a relative- 
ly small-scale decimal, variable-length machine whose 
structure evolves solely from those two features. Ina 
sense, the work reported here can be interpreted as an 
elaboration or feasibility study on some conjectures 
made recently by Foster [4] concerning the architecture 
of the average computer of the year 2000. The design 
study described in more detail in the following sec- 
tions in fact supports the feasibility of the basic 
concept even in terms of present day technology. 


II. MACHINE ORGANIZATION 


In order to provide the variable-length characte- 
ristic for data, OP-codes and addresses, it is necess- 
ary to employ some "length delimiters". Thus it is 
apparent that a truly binary machine could not be cons- 
tructed to meet such requirements, since the range of 
available digits (0,1) leaves no spare codes which 
could serve as delimiters, Hence we must turn to a 
higher base system, which for practical reasons might 
be binary coded. 

Our choice is the decimal system with binary coded 


implementation, This provides us with,six codes (other 
than 0-9) for use as delimiters, We will call them 


i (05 BY, hye) 
The machine has a random-access memory with the 
capacity of 100,000 digits, addressable to the digit 
(0-99 ,999), 


NUMBER AND CHARACTER REPRESENTATION 


In order to represent real numbers of the form N 
x 10€, both N and e are expressed as a sign followed by 
a 10's complement value. The exponent e is stored 
first, followed by the significant digits N, both num- 
ber fields occurring low-order digits first. 

For example +318.27 x 107!% is represented and 
Stored in memory as -68+72813*, which is equivalent to 
31827 x 10714, The decimal point is always implied at 
the low order end of N. Note that * could be any deli- 
miter. 

Integer form is used for addressing purposes only 
Ce = 0 and it is not shown explicitly) and it is recog- 
nized as such from the context of instructions. We will 
refer to {+} followed by a string of digits as a number 
field or address, depending on the context, and use the 
name number or real number to refer to two successive 
number fields. 

It is important to observe that the low-order di- 
gits are stored first, because the arithmetic unit must 
be at least partly serial, to enable it to handle arbit- 
rarily long numbers. 

The ASCII character set is represented directly 
using 2 4-bit digits per character. Any two digit deli- 
miter not in the ASCII set, referenced in this paper as 
(0, is used as the character string delimiter. In gene- 
ral, the address of a character string, number, address, 
or instruction is the address of the leading delimiter, 
Since this delimiter usually describes some property of 
the information to follow. 


INSTRUCTION SET 


The machine has thirty-one instructions, including 
four rudimentary I/0 instructions. Instructions are 
delimited by «. An unsigned decimal opcode follows the 
leading a and its end is indicated by any single digit 
delimiter. All instructions except HALT have an ope- 
rand list which in some cases is preceded by a parame- 
ter K. The delimiter 6 indicates that K is present, 
and the possible values for K are integers equal to or 
greater than 0. The K value may be referenced by any 
of the addressing modes of Table 1. The interpretation 
of the delimiter set when used to separate items of the 
operand list is shown in Table 1. Table 2 gives the 
complete instruction set. In instructions where the 
parameter K is called for, it may be omitted if K = 1; 
there being no ambiguity, since A can never be imme- 
diate data. The number of operands is variable in ADSB 
and MUDI instructions. 
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A few examples should clarify the appearance in 
memory of complete instructions, and give an idea of 
instruction execution. 

(i) aMVN62+4196-97+320 ~21 1 Po Dz gre DL 1 2 
This instruction inserts the real number 23x10 , iar cae 
represented by 2 number fields, into the memory 1 2 °° “n/2 
starting at the address 914; while oMVNS4+419- compares the character string C1Co...Cn/2 with 
the one stored at 100 and sets the 2-bit condi- 
tion vector in the CPU to 00 if they are equal, 


0001la moves the four number fields (2 real nun- 
and to 11 if they are not; while oCPC+0017456+27a 


914. If the number field (address) stored at 
1000 has a "-"' leading delimiter, then another 
level of indirection is indicated. 


(ii) aCPC+0015D, D, D, D D (fla 


bers or 4 addresses) starting at the address 


stored at location 1000 (indirect mode) into 


TABLE 1 
Interpretation of delimiters in instructions 


Delimiter 


Addressing Mode 
or Function 


Operand Form 


direct 


| unsigned integer address. 
7 | indirect 
| 


unsigned integer address, 
indexed unsigned integer address (the location of 
the index), delimited by + or - indicating 
a direct or indirect address to followe 
6 immediate number, address or character string, 


appropriately delimited, 
instruction delimiter 
operation change 
(indicates SUBTRACT 
instead of ADD and 
DIVIDE instead of 
MULTIPLY in the ADSB 
; and MUDI instructions.) | 


TABLE 2 
Instruction Set 


QPERAND LIST DESCRIPTION 


Move number fields (up to Kth delimiter) from B to A 
Move character strings (up to Kth delimiter) from B to A 
Move K digits from B to A 


wv } 


wv we w& 
www 


we 


rPrPPrARnaA 


we we we 


we 


Compare 
Compare 
Compare 
Compare 


addresses at A and B 

numbers at A and B 

character strings at A and B 
digit list at A and B 


Logical "AND" of K digits at B and C into A 

Logical "@R'' of K digits at B and C into A 

Logical "complement of K digits at B into A 

Truncate number at A to K significant digits 

Clear K digits starting at A 

Move pointer at A over Kth number field delimiter 

Move pointer at A to the end of the Kth number field 
Move pointer at A over to Kth character string delimiter 
Move pointer at A to the end of the Kth character string 
Add/Subtract B,C,... and put in A 

Mult./Div. B,C,..., and put in A 

Add/Subtract addresses (single number field) 

9's complement of the number field starting at B into A 
Br zero 

Br nonzero 

Br negative 

Br positive 


~ 


w 
an 


~~ 
ww we we 2 


w_ we we we Bw 
www w 
vi 


aang 
we we 


we wv we we 


WDHWW PP PP b> > Pr rrrwwwrp>> 


conditional branches 
to A 


Unconditional branch to A 
Subroutine linkage to A 
stop 

I/O w.r.t. device K; 
transfer 16 bits of data 
I/O w.r.t. device K; 
transfer 8 bits of data 


rPrPrPrrPrrrPKrPrFRARARARAAABRAAN 


AAR A 
> Prr> 


w 
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compares the character string at [[654]+72] 
where [...] indicates "the contents of". 

(iii) oMVPN621+2a takes [2] as an address and, assu- 
ming [2] points at a number field delimiter, in- 
creases the value of [2] until it points at the 
12th number field delimiter from the starting 
point. This would allow [2] to now point at the 
6th number down the list from the starting num- 
ber. This provides the means for accessing 
variable length number or character strings ina 
list of such items where the programmer knows 
explicitly only the address of the first item. 

In the above examples, we have used appropriate 
mnemonics for the OP-codes, but they are actually spe- 
cified in memory by unsigned integer codes. 


III. HARDWARE DESIGN 


Since all parameters may be variable in length, a 
fully parallel design of the machine cannot be 
achieved. It is apparent that serial by digit struc- 
ture would be the simplest solution in terms of hard- 
ware costs. However, in order to attain a reasonable 
processing speed, some degree of parallelism must be 


introduced. 

In our prototype design we have chosen serial 
processing of four-digit (16 bits) blocks of data. 
Figure 1 shows the block diagram of the machine. 

Memory has a 16 bit word length and its address- 
ing is arranged in a 4 x 25 x 1000 digit pattern, 
giving a total capacity of 100,000 digits. It is 
digit-addressable, necessitating two internal read 
cycles if the address is not 0 or divisible by 4. In 
order to avoid alignment difficulties on the data bus 
and in the 4-digit parallel arithmetic unit, the memo- 
ry includes alignment circuits so that the memory data 
register always contains the addressed digit plus the 
three digits that follow. Thus it is not necessary to 
impose any boundary alignment conditions on the prog- 
rammer for storage of data in the memory. 

Internal sequencing and serial control of instruc- 
tion execution when the operand length exceeds 4 digits 
is regulated by the pointer registers Pl, P2, P3 and 
P4, each being a 5-digit counter-register. 

All addressing is carried out via a 5-digit 
address bus. Since addresses are obtained directly 
from instructions they are not necessarily correctly 
aligned on the data bus. This is remedied by 
assembling all addresses in the address register which 


FIGURE 1 
Block Diagram 
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includes the required alignment switches. | 

Peripheral devices PD1,..., PDN are addressed 
through the low order digits of the address bus, with 
data transfer handled by the data bus. 


IV. PROGRAMMING CONSIDERATIONS 


Consistent with the theme of Foster's [4] brief 
sketch of the average computer of the year 2000 which 
".,. will-be a monoprocessor doing its own I/0, - most 
probably be privately owned and monoprogranmed, - be an 
interpretive engine capable of executing directly one 
or more high-level languages...", it is claimed that 
although we do not interpretively execute several high- 
level languages, the instruction set of Table 2 makes 
possible efficient processing on a "one-shot" basis of 
relatively small user programs. This is a reasonable 
goal for a small, general, privately owned and mono- 
programmed computer in any event. The efficient pro- 
cessing we mentioned above is from the programmer 
standpoint. This means that the machine language, re- 
presented in some assembly form, should have instruct- 
ions and formats that make coding of normal problems 
somehow natural and concise. 


MATRIX MULTIPLY ROUTINE 


We first present a complete program to multiply 
two matrices of real numbers. All matrix entries are 
of variable length, so normal indexing would not work 
on any machine, and the equivalent program in a fixed 
word length structure would be somewhat clumsy and un- 
natural. 

The program performs the 

C = A xB where A is ID 


computation 
rows by JD columns, 
B is JD rows by KD columns, 
and C is ID rows by KD columns. 
Matrices are stored in column order and the program 
variables for the matrix dimensions are the same as 
above. 
Assuming that the matrices A and B have been 
loaded in core and ID, JD, and KD have been appropriate- 
ly initialized, the program is: 


ADSBA II«ID+ID ;increment step for PT1 
ADSBA K<-KD to access successive 
MVN JV<0 row entries. 
MVN PT3<#C ;load address of C into 
KL@@P: MVN M<0 PT3. 
ADSBA I<-ID 
IL@GP: MVN PT2<#B 
MVPN JV,PT2 ;sets PT2 to appropriate 
MVN PT1<#A column of B. 
MVPN M,PT1 ;sets PT1 to appropriate 
ADSBA J+-JD | row of A, 
MVN 2,'PT3<0-0 ;clear cj «(initial length 
unimportant.) 
JLOGP: MUDI TEMP<'PT1*'PT2 3a; Xb; k 
ADSB  'PT3<'PT3+TEMP saccumul ate into Cj k 
MVPN 2,PT2 ;move PT2 across 2 delimi- 
ters to b(j+1) k. 
MVPN II,PT1 smove PT1 to ai (j+1) 
ADSBA J<J+1 : 
BRN JL@@P 
MVPEN 2,PT3 smove PT3 to next C entry 
ADSBA M<«M+2 
ADSBA I<I+1 
BRN IL@@P 
ADSBA JV+JV+JD+JD ;sets B column accessing 
ADSBA K+K+1 variable. 
BRN KL@@P 
HALT 


In this program, and in the remainder of this sec- 
tion, we have used a suitable assembler notation for 


the parameter and operand lists for instructions. For 

example, #C refers to the address of C, and 'PT3 indi- 

cates indirect addressing through location PT3. There 

are no macro references; and there is a strict one-to- 

one correspondence between the lines in the program and 
machine instructions. 

The previous example was concerned with arithme- 
tic operations and array accessing. We now illustrate 
some aspects of non-numerical programming. The example 
chosen can be taken as a model of some aspects of sym- 
bol table manipulation in a language processor. The 
main idea here is to illustrate the ease of building 
and searching tables of variable length mixed data 


types. 


SYMBOL TABLE MANIPULATION 


A particular type of character string made up of 
ASCII symbols sd te a is to be processed. 


<A> <D> 

A syntactically correct string must start with a 
member of <A> and end with a member of <D> followed by 
$, with no other occurrences of $, and with all other 
occurrences of members of <D> isolated by members of 
<A>, 

There are also some semantic rules that must be 
met. First, we need some definitions. Each occurrence 
of a member of <D> will be said to ''terminate" the pre- 


- vious contiguous substring of members of <A>, and the 


class name <LABEL> will be used to describe any such 
contiguous substring of members of <A>. Now, a syntac- 
tically correct string is also semantically correct if 
all <LABEL>'s terminated by ":" are unique and any 
<LABEL> terminated by ";" also appears in the string 
terminated by '"':", 

Examples of correct and incorrect strings are: 

(i) START: LOMP:CTR:LGPP;O0UT:$ is both syntactic- 
ally and semantically correct. 

(ii) A:A;SRCH: ;COMP:SRCH:A:$ is both syntactically 
and semantically incorrect (see the underlined places). 

The processing to be performed on these strings 
is as follows: Build a table in core of all unique 
<LABEL>'s with an address associated with each. If the 
<LABEL> first occurs terminated by '"':'", the address is 
provided from the contents of a word addressed as 
LOCCTR; otherwise, the 5-digit value 00000 is associa- 
ted with the <LABEL>. This "dummy" address will be re- 
placed by the correct value from L@CCTR when the 
<LABEL> later occurs terminated by "':", 

There are two subroutines used in the program 
which we will list in detail. One, called TBLSCH, is 
used to search the table for the occurrence of the 
<LABEL> currently in 'BUFF. The locations T@P1 and 
BOT1 are pointers to the top and bottom of the table, 
respectively; and PTl is a pointer location for access- 
ing the table entries. On exit, put "Y" in ANS if the 
<LABEL> is found, and leave PTl pointing at the lead 
delimiter for the matching <LABEL>; otherwise, put "N" 
in ANS. The routine is accessed from a JMPS instruct- 
ion which puts the return address in the first location, 
TBLSCH, in the routine. 

The coding is: 


TBLSCH: DA 5 sassembler command to 


MVN PT1<BQ@T1 establish a 5-digit "retum 
CPA TO@P1<+BOT1 field. 
BNZ CHECK 3go0 to CHECK if table non- 
MVD 2,ANS<"'N" empty, 
BR 'TBLSCH | 

CHECK: CPC "BUFF<+'PT1 ;compare LABEL in 'BUFF 
BZ FOUND with one in table. 
MVPC PT1l smove PT1 to start of next 
ADSBA PT1<PT1+2 <LABEL> in table 
MVPN PTl 


ADSBA PT1<PT1+1 
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CPA PT1«>T@P1 shas whole table been 


BNZ CHECK searched? 
MVD 2,ANS<''N" 
BR 'TBLSCH 
FOUND: MVD 2,ANS<''Y"! 
BR 'TBLSCH 


The second routine, called TBLINS, inserts the 
<LABEL> in 'BUFF onto the top of the symbol table and 
associates the address in PARAM with it. The pointer 
T@P1 is adjusted appropriately, 


TBLINS: DA 5 
MVN PT1<T@P1 
MVC 'TO@P1<'BUFF sadd <LABEL> 


MVPEC PT1 

MVD 2,'PT1<"[])"'; sinsert character delimi- 

ADSBA PT1<PT1+2 ter 

MVN 'PT1<PARAM ;insert associated address 

MVPEN PT1 

MVD 'PT1<'+! sinsert number field deli- 

ADSBA PT1<PT1+1 miter 

MVD 2,'PT1<"({)" ;insert table-top delimi- 
ter 

MVN T@P1+PT1 ;adjust table-top pointer 


BR 'TBLINS 


Although we have only presented two of the sub- 
routines used in the complete program, the type of 
coding used at the assembler level for non-numeric 
processing should be evident, The complete program 
required 110 instructions, including the subroutine 
coding. 

Due to the radically different structure, it is 
difficult to compare our machine with standard minicom- 
puters, Meaningful comparisons will become possible 
only as a result of extensive experience with it. The 
machine was simulated and some interesting observations 
made. For example, the above matrix multiply routine 
was found to require 400 digits of storage with the 
delimiter density of 25%. 


V. CONCLUSIONS 


We have described the design of a fully variable- 
length general purpose computer. In order to assess 
the feasibility of such machines it is essential to 
take a close look at advantages gained and difficulties 
that might be encountered. 

Based on a number of programs that we have written, 
it is apparent that programming presents fewer diffi- 
culties than one usually encounters with standard mini- 
computers. 

Limits on computational accuracy, size of operand 
labels and data as well as the alignment requirements, 
are non-existent from the programmer's point of view 
by the very nature of the machine. 

In order to determine the physical feasibility of 
such machines, we have completed the design on the 
basic circuit level (using standard TTL MSI components). 
As a result we have found that the hardware complexity 
and cost place the machine in the price range of 
typical minicomputers. Simulator runs have been used 
to verify the logical correctness and adequacy of the 
selected instruction set, as well as to obtain an 
evaluation of memory and cycle time requirements. 
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ABSTRACT 


Many problems, inherent in air traffic control, 
weather analysis and prediction, nuclear reaction, 
missile tracking, and hydrodynamics have common 
processing characteristics that can most efficiently 
be solved using parallel ''non-conventional" tech- 
niques. Because of high sensor data rates, these 
parallel problem solving techniques cannot be eco- 
nomically applied using the standard sequential 
computer. 


The application of special processing techniques 
such as parallel/associative processing are still 
resisted because it is a change from the norm. Past 
implementations utilized special hardware, custom 
circuits and complex designs. The Honeywell Asso- 
ciative Parallel Processing Ensemble (HAPPE) was 
built to demonstrate the basic simplicity of hardware 
concepts inherent in parallel associative processing. 
HAPPE is implemented with the same standard cir- 
cuit building blocks (MSI's, ROM's, and RAM's) that 
are used in all conventional computers. By using 
standard building blocks a long time objection to 
associative memory systems, that of requiring 
Special purpose (low usage) circuits is overcome. 


The parallel/associative processing element can be 
both powerful and versatile. The HAPPE architec- 
ture proves that one processor element can perform 
both correlation (associative) and arithmetic pro- 
cessing. The HAPPE demonstrator has become a 
valuable tool in training system designers and pro- 
grammers to recognize that new ''thinking'’ can be 
applied to advanced processing system 
implementations. 


Introduction 


Problems associated with systems such as radar 
tracking and discrimination, air traffic control, 
weather prediction, nuclear reactor control and 
hydrodynamic prediction have common character- 
istics. Each of these problems exhibits a high 
degree of parallelism; that is, many sets of data 
must be evaluated, manipulated and reduced by the 
same computing process as rapidly as possible and 
preferably simultaneously. 


Those systems requiring a real time problem solu- 
tion should take advantage of this parallelism by 
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solving each problem set in a simultaneous manner. 
The architectural solutions to these problems have 
led to considerable discussions on special process- 
ing techniques particularly on pipeline processing 
versus parallel associate processing. Numerous 
papers have been written on the subject. It has been 
shown that with the technology available today (2), 
many of the problems discussed earlier can be 
solved by the pipeline processor in a more econom- 
ical way. These problems, however, require that 
the input sensor data always occur in the same 
order. As the input data becomes more unordered 
or random, the parallel associative processor be- 
comes the only candidate available for real time 
processing (1), (4). 


The Honeywell Associative Parallel Processing 
Ensemble, HAPPE, was built with standard building 
blocks to demonstrate the hardware concepts and in- 
herent simplicity and computing power of parallel 
associative processing and the processing element. 
Its functional organization is designed to accommo- 
date both I/O associative processing and simulated 
track processing. 


During track processing, the input space built into the 
hardware is divided into two random sets. Thus if an 
input sensor or target has provided an input in one of 
the two sets, for associative processing and then 
switches to the other set, the track processing asso- 
ciated with the first set of data will be delayed a cycle. 
However, we still have no prior knowledge of when 
the data arrives at the processing system. The 
HAPPE demonstrator proves that one processor ele- 
ment (hardware entity) can perform both correlation 
(association) and arithmetic processing. 


The HAPPE processor is implemented with the same 
standard building blocks (MSI's, ROM's and RAM's) 
that are used in more conventional computers. This 
overcomes a long-time objection to associative mem- 
ory systems that require special purpose building 
blocks, such as complex one-of-a-kind large scale 
integrated circuits. 


Background 


The evolution of the parallel associative processor is 
shown functionally in Figure 1. Reference (3) de- 
scribes the historical evolution of associative 
processors. 


PARALLEL PROCESSOR “ASSOCIATIVE MEMORY 


MANY PE’s 


e EACH CONSISTS OF 

ARITHMETIC UNIT AND 
RANDOM ACCESS DATA 
MEMORY 


e CAN PERFORM PROCESS- 
ING OPERATIONS 
SIMULTANEOUSLY ON 
MANY DATA SETS 


CONTROL 
UNIT 


SEARCH REG 
MASK REG 


e LOGIC DISTRIBUTED 
THROUGHOUT MEMORY 


e CAN PERFORM SEARCH 
(COMPARISON) OPERA- 
TIONS SIMULTANEOUSLY 
ON MANY DATA SETS 


RESULTS 
REGISTER 


ASSOCIATIVE PROCESSOR 


eo MANY PE’s 


oe EACH CONSISTS OF ARITHMETIC 
UNIT AND ONE WORD OF_ASSOCIA- 
TIVE MEMORY 


e CAN PERFORM SEARCH AND 
PROCESSING OPERATIONS 
Ser Ame OUSEY ON MANY DATA 


CONTROL 
UNIT 


RESULTS ARITHMETIC 
REGISTER UNITS 


PARALLEL ASSOCIATIVE PROCESSOR 


CONTROL 
UNIT 


eo MANY PE’s 


e EACH CONSISTS OF AN 
ARITHMETIC UNIT ANDA 
COMBINATION OF RANDOM 
ACCESS AND ASSOCIATIVE 

DATA MEMORY 


e CAN PERFORM SEARCH AND 
PROCESSING OPERATIONS 
SIMULTANEOUSLY ON 
MANY DATA SETS 


Figure 1. Parallel Processor/Associative 
Processor Relationships 
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The parallel processor is characterized by a control 
unit and a number of processing elements that per- 
form simultaneous operations. The control unit to 
processing element bus is used primarily in a se- 
quential manner to either carry inputs, commands or 
outputs. 


The Associative Memory is primarily a storage 
facility witn logic added to each memory position to 
perform associative searches. These searches are 
classified primarily as input searches and output 
searches. 


When performing an input associative operation, an 
input data set is processed by means of the following 
search operations: 


What stored data is less than the input? 
What stored data is greater than the input? 


What stored data is within a delta limit of the 
input ? 


No stored data is within a delta limit of the 
input. 


The input operation is primarily a matching opera- 
tion between data sets stored in the processors and 
randomly occurring input data sets. 


The output associative operation searches the asso- 
ciative memory to determine: 


The stored data with the minimum value. 
The stored data with the maximum value. 
All stored data within a given sector of space. 


All stored data outside a given sector of space. 


The output operation is used to provide the user with 
the exact information he needs to make an opera- 
tional or control decision, 


The associative processor is really closer in func- 
tional execution to the associative memory than it is 
to the parallel processor. Primarily then, it is an 
associative memory with minimal processing capa- 
bility associated with each memory word. The 
STARRAN is an example of the associative processor. 


As we move up the functional ladder from associative 
memory, to associative processor, to parallel asso- 
ciative processor, we are asking the system to per- 
form more and more processing for every associa- 
tive or correlation operation. Thus the parallel 
associative processor is aimed at the application 
requiring a high random sensor input rate, accom- 
panied by a large number of sensor dependent com- 
putations performed on each input, in real time. The 
HAPPE system was designed to demonstrate the 
functional sophistication afforded by parallel asso- 
ciative processors. 


HAPPE System 


The HAPPE system demonstrates the use of asso- 
ciative parallel processing in solving the complex 
problem of target tracking using a modern phased 
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array radar as asensor. Phased array radars pro- 
vide a modern tracking system with the ability to 
electronically point and shape one beam, or a multi- 
plicity of beams, and to steer them at electronic 
speeds. Target location coordinates in phased array 
radar systems are specified by range and beam num- 
ber, rather than by range, azimuth, and elevation. 
The beam number incorporates both azimuth and 
elevation of the target. This data gathered at elec- 
tronic speeds, demands close coupling with its pro- 
cessing operation in order to meet the workload and 
real time restrictions of this high performance 
system. 


The HAPPE organizational design consisting of two 
global control sections and a number of redundant pro- 
cessing elements is primarily based on the following 
observations: 


(A) Radar data is constantly coming into the sys- 
tem without interruption. First, radar data 
comes from random beams in the first half 
of the sensor space and then from the second 
half of the Sensor space. 


(B) Target data is associatively assigned to pro- 
cessing elements then an arithmetic program 
(the track processing) is completed using the 
correlated data before the next correlation 


can take place. 


(C) A standard arithmetic building block (the 
''181'') performs the arithmetic add, subtract, 
and logical operations; as well as the less 
than, greater than, and equal to correlation 
operations. 


(D) The radar tracking algorithms can be per- 
formed without built-in multiply or divide 


capability. 


An analysis of the requirements stated above led to 
the architecture shown in Figure 2. This block dia- 
gram resembles the parallel associative processor of 
Figure 1. But, operationally HAPPE uses a different 
philosophy from any other associative processor. In 
the HAPPE system, the processing element controls 
its own mode and process state. The control units do 
nothing but repeatedly broadcast their algorithms or 


commands and data, never knowing whether any ele- 


ments are reacting to these commands or not. 


In comparing HAPPE and PEPE, the HAPPE system 
contains two control units and one processing element 
unit, while the PEPE system contains three control 
units and three processing element units. In HAPPE, 
the control units alternately control the element unit; 
but in PEPE, each of the control units controls only 
its associated element unit. 


The processing elements operate in two modes-- the 
correlation mode and the arithmetic mode-- and with- 
in each mode two states-- active and inactive. Of 
course the system must contain a master reset which 
sets every element to the correlation mode and active 
state. 


When an element is in the correlation mode and active 
state, it performs all commands and accepts all data 
from the correlation control unit. 


ARITHMETIC 
CONTROL 
UNIT 


CORRELATION 
| CONTROL 


ELEMENT NO. 1 | 
DATA 
erga LA OATAL 4 | 


_—, 
ELEMENT NO. 1 
LOCAL 
CONTROL | 
ic ELEMENT NO. 2 


a a 


LOCAL 
CONTROL 


ELEMENT NO. 3 
PROCESSOR 


Figure 2. System Block Diagram 


Any element in-the correlation mode and inactive 


state can perform the following commands from the 


correlation control unit: 


Activate All 
Master Reset 


Select Next Inactive. 


Store Global Data to Local Memory 

Activate All 

Switch Modes 

Master Reset 

Select Next Active 

Select Next Inactive. 
The Arithmetic Control Unit provides the following 
commands: 

Load Local Memory to A Register 

Load Local Memory to B Register 

Store A Register to Local Memory 

Add 

Subtract 

Switch Modes 

Deactivate on Not Equal to 

Shift Left 

Output 

Master Reset 

Activate All. 
A block diagram of the processing element is shown 
in Figure 3. The radar processing problem requires 
that a correlation process assign targets to process- 
ing elements and then process the data associated 
with each target. The processing element resembles 
any minicomputer containing two registers, an arith- 
metic unit, and a scratchpad memory. The element 
differs by containing a local activity and mode control 


which can be modified by the comparison outputs of 
the arithmetic unit. 


When an element is in the arithmetic mode and active 
state, it performs all commands and accepts all data 
from the Arithmetic Control Unit. Any element in 
the arithmetic mode and inactive state can perform 
the following commands from the Arithmetic Control 
Unit: 


Activate All 


Master Reset. 


The correlation control unit provides the following 
commands: 


Data 

Control 
Command 
Control 


Figure 3. Processing Element Block Diagram 


Deactivate on Not Equal to pas 


Mode 
Control 


Deactivate on Less Than 


Deactivate on Greater Than 


To all blocks 
on this page 


Commands 


Load Local Memory to A Register 
Load Local Memory to B Register 
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Each processing element operates like a correlation BEAM 


unit when in the correlation mode (accepting com- C) NO. 1 
mands from the correlation control unit) and like an Ja 
arithmetic unit when in the arithmetic mode (accept- 7 e 


Each HAPPE processing element performs add, sub- 
tract, and shift operations, as well as equal to, less se 
than, and greater than operations. During the arith- oe 
metic mode, the track update programs are per- 

formed and during the correlation mode, the target 
data assignment programs are performed. 


; oF 
ing commands from the arithmetic control unit). LE 


Demonstrator Description 


The HAPPE demonstrator under discussion and 
illustrated in Figure 4, consists of three 4-bit pro- 
cessing elements. The demonstrator operates with 
two beam numbers (a realistic system would have 
about 250) and two incoming targets that have an 
arbitrary decreasing range assignment of 15 to 0 
units. A realistic system would look at up to 1000 
targets coming in from about 150 KM. 


Figure 5. Physical Model 


TIME 8 9 10 11 12 13 14 15 


Figure 6. Target Movement 


Figure 7 describes the control panel, the initialization 
of the demonstrator, and the results of the target 
tracking program. For example, processing 

element 1: 


A. During the set up of the demonstrator, the 
beam number (beam 1) is stored in scratch- 
pad location (X1, Y1), the lower gate (Range 
2) in (X1, Y2), and the upper gate (Range 8) 


Figure 4. HAPPE Demonstrator 


The problem solved by the demonstrator consists of 
recognizing the beam number of the target, correla- 


ting the range of the target with a range gate in each aE es 

processing element, and then performing an arith- B. During the operation of the simulation, the 
metic subroutine on the correlated data. A pictorial processing element only reacts to the target, 
of the simulated system is shown in Figure 5 and an while it is in the range of 8 to 2. 


example of the radar return for an incoming target is 
shown in Figure 6. An example of a range gate com- 
putation for one target can be seen in Figure 6, with 
the R (minimum) set at 10 and the R (maximum) set 
at 13. A target moving through this range gate will 
cause four between-limit matches. 


C. At the end of the operation, the element has 
seen the target seven times. 
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CONTROL PANEL 
(Switches Up are On) 


Global Data Lines (G', G2, G?, G4) - Provides binary coded 
data to all elements which are active and in the 
correlation mode. 


X Data Lines (X1, X2, x3, x4) - Selects the X data 
location of the element memories. 


Y Data Lines (yi, Yy2, y3, y4) - Selects the Y data location 
of the element memories. 


D. O. Allows the wired “OR” outputs of the element 
“A” registers (which are active) to be displayed 
on the panel lights (A1, A2, A3, A4, A4). 


Auto -_ Sets the control units in the auto clock or 
single step program mode. 


Reset - Resets the counters of the control units to 
restart the program. 


Clock - Allows the control units to be single stepped 
through their programs when the auto switch 


is off. 
CR - Sets all elements to the correlation mode. 
SW  -_ Switches the mode (correlation or arithmetic) 


of all the elements that are active. 
ACT - _ Sets all elements to active. 


Z, - Clocks the data selected by global lines into 
| the memory location selected by the X, Y 
switches (for all active elements in the corre- 
lation mode). 


2. - Clocks the data from the memory location 
selected by the X, Y switches into the A register 
(for all active elements in the correlation mode). 


2, - Clocks the activity flipflop when using the select 
commands 
SFA - Activates the pointer for selecting the first active 


element in the correlation mode. 


SFA -_ Activates the pointer for selecting the first inactive 
element in the correlation mode. 


SET UP OF DEMONSTRATOR 
PUSH CR BUTTON 

PUSH AS BUTTON 

WRITE (Z,) OIN X3, Y, 

WRITE (Z,) 0 IN X3, Y, 

WRITE (Z,) 2 (G,) IN X,, Y, 


WRITE (Z,) 5 (G,, G,) IN Xo. Y5 
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WRITE (Z,) 4 (G,) IN X,, Y3 
WRITE (Z,) 0 IN X., Yq 

PUSH SFA AND Z, 

WRITE (Z,) 1 (G,) IN X4, Yy 

WRITE (Z,) 2 (G5) IN X,, Yo 

WRITE (Z,) 8 (G,) IN X4, Y3 

PUSH SW BUTTON 

PUSH SFA AND Z, 

WRITE (Z,) 2 (Go) IN X,, Y, 

WRITE (Z,) 6 (Ga, Go) IN X,, Yo 
WRITE (Z,) 13 (Gy, Gz, Gy) IN X,, Y3 
PUSH SFA AND Z, 

WRITE (Z,) 1 (G,) IN X,, Y, 

WRITE (Z,) 4 (G4) IN X,, Yo 

WRITE (Z,) 6 (Ga, Go) IN X4, Ya 
PUSH AS BUTTON 

PUSH CR BUTTON 


SELECT AUTO MODE 


RESULTS 

TARGET TRACKING PROGRAM 
Element Number 1 7 in A-Reg. 
Element Number 2 8 in A-Reg. 
Element Number 3 3 in A-Reg. 


ARITHMETIC PROGRAM 
(All Elements (X., Y4) three) 
Push As 
Select Xo, Y 4 
Push Z, 
Read A-Register 
TO RESTART PROGRAM 
Push CR Button 
Push AS Button 
Select X Y, 
Push Z, 
Push Z. | 
Clear Xo, Y, 


3’ 


Push Reset 


Programs 


The sample programs executed by HAPPE are 
typical of real-life problems and are described in 
the following: 


(A) 


(B) 


(C) 


In the correlation program, one of two 
beam numbers is compared with the beam 
number data stored in the elements. A 
between limits search on the range data 
provided by the simulator is then made. 
During the limit search, each of the three 
processing elements will have made use 

of its own range gate. At the conclusion 
of the correlation processing mode, those 
elements that have just performed corre- 
lations are switched to the arithmetic mode 
and those that have completed the arithmetic 
processing are switched to the correlation 
mode. 


The arithmetic program updates and accu- 
mulates the number of between-limit hits in 
each of the elements. This program also 
demonstrates typical arithmetic operations 
which are required to calculate a new 

range gate. 


The select highest and output program 
selects the one element which has the 
highest number of range gate hits and out- 
puts this number to the control unit. 


The page limitations of this paper have prevented the 


inclusion of the program flow diagrams. 


These 


can be obtained by contacting the author. 


Conclusions 


The HAPPE demonstrator shown in Figure 4 was 
built to demonstrate several important concepts of 
parallel/associative processing through use of a 
simulated phased array radar processing system. 
These are: | 


(A) 


(B) 


(C) 


(D) 


(E) 


(F) 


That one processing element implemented 
with standard conventional logic circuits 
can perform both correlation and arith- 
metic processing. 


That two control units operate simulta- 
neously on different processing elements. 


That the operation, initialization, mode 
switching, and activity of parallel pro- 
cessing elements can be easily controlled. 


That the signal distribution and busing 
required to effectively operate a parallel 
associative processor uses standard 
circuits. 


That the correlation and arithmetic pro- 
gramming takes advantage of standard pro- 
gramming methods with the addition ofa 
small number of parallel control 
instructions. 


That HAPPE has a built in ''Fail Graceful'' 
characteristic where one element can fail or 
be disengaged from the system and the rest 
of the system operate at a reduced target 
load. 
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(G) That a low cost, three element, 4-bit data 


associative parallel processor can be built 
to demonstrate the operating characteristics 
of a large system. This demonstrator is 
also an excellent training tool for advanced 
architectures and the software required to 
make them operate. 


Bibliography 


(1) Hobbs, L. C., 'Parallel Processor Systems, 


(2) 


(3) 


(4) 


Technologies and Applications'', Sparton Books, 
1970. 


Lloyd, G. and Merivin, W., ‘Analysis of Three 
Large Computer Systems'', AFIPS National 
Computer Conference, June 1973. 


Parhami, B., ''Associative Memories and 
Processors: An Overview and Selected 
Bibliography'', Proceedings of the IEEE, Vol 61, 
No. 6, June 1973. 


Thurber, K. J. and Berg, R. O., ''Applications 
of Associative Processors'', Computer Design, 
Vol 10, pages 103-110, November 1971. 


A COMPUTER ARCHITECTURE AND ITS 
PROGRAMMING LANGUAGE 


Mario R. Schaffner 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


l. INTRODUCTION 

Computer architectures and programing 
languages are traditionally developed independ 
ently. Through suitable computer architecture, 
for instance, one can attempt to speed up the 
processing of a stream of data and instruc” 
tions, leaving to the software the burden of 
preparing these streams. Through a suitable 
programing language, one aims at efficiently 
describing many classes of problems, in a 
phrase-structure form that is machine independ 
ent. This approach has led to the ever-increas 
ing application of computers, but it has also 
brought about a growing complexity in the soft 
ware systems. As a consequence of the latter, 
there is a new trend toward extending hardware 
implementation to replace the software ones, 
especially in view of the new technology of 
large scale integration. 


In the past, several computer architec- 
tures were suggested toward the attainment of 
a larger computing power, such as the Solomon 
computer [1], the Holland machine [2], a spa- 
tially oriented computer [3]j, and a fixed plus 
variable structure computer [4]. Then, two 
basic techniques appeared of more general ap- 
plicability, and led to actual implementations: 
parallel processing, and pipeline execution 
[5]. In parallel processing, an array of simi- 
lar processors work simultaneously on different 
data, under the control of the same control 
unit. The modularity of such an organization 
is attractive in many respects. However, the 
performance is heavily dependent on parallelism 
in the problems [6], and programing techniques 
need to be developed, for exploiting the poten 
tial capabilities of the computer and the 
inherent parallelism in the computations [5]. 
Whereas for particular problems parallel com- 
puters can achieve a throughput which is orders 
of magnitude larger than that of conventional 
computers, for general problems they face a 
performance degradation that increases with 
the number of processors. 


Pipelining consists of the concurrent 
execution of the various stages of the process 
ing by independent units connected in cascade. 
This concurrency can be implemented at differ- 
ent levels [7]; the overlapping of the proc- 
essor and memory operations [8], and that of 
the steps of arithmetic operations [9] are 
examples. The theoretical limits of pipelining 
have been analyzed [10]; in practice, advan- 
tages depend upon the presence of a stream of 
similar tasks [11]. All modern large computers 
have some’ degree of parallelism and pipelining. 
In ordér to analyze their organization, instruc 
tion and data streams can be defined, and the 
management of requests and services considered 
[12]. In this context, computer architectures 
are, at first, classified as single-instruction 
single-data streams, single~instruction multi 
ple-data streams, multiple-instruction single- 
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data streams, and multiple-instruction multi- 
ple-data streams. 


The use of these computer architectures 
depends heavily on complex compilers, or inter 
preters. Compilation requires a preliminary _ 
run, generally produces no optimum codes and 
makes the debugging more difficult. Interpret- 
ers require a large memory space and produce 
a slow execution. For these reasons the ques- 
tion rises recurrently whether computers con- 
structed to directly execute programs written. 
in the user programing language could lead to 
a more efficient overall system [13]. The 
above question has prompted several works 
oriented to the hardware (or a mixture of hard 
ware and software) implementation of the produc 
tion of phrase-~structure programing languages 
that are subsets of existing programing lan- 
guages, or a slight variation of them. Some 
examples are: Anderson's [14] implementation 
of Algol 60, Bashkow's et al. [15] design of 
a Fortran machine, Weber's [16] implementation 
of EULER, Thurber'’s et al. [17] design of a 
cellular APL computer, the SYMBOL language and 
computer [18], and the APL implementation by 
Hassitt et al. [19]. All these studies show 
particular advantages in more closely relating 
the structure of the programing language and 
the structure of the computer hardware. However, 


no significant impact was made on the main- 


stream of computers, in which languages and 
hardware are developed independently. One can 
argue that in the above cases the languages used 
(at least basically) already existed and were 
developed independently of any particular 
architecture. 


This paper shows a case in which computer 
architecture and programing language are not 
developed independently, but are treated as 
two isomorphic forms of representation of the 
Same structure -- the abstract mechanisation 
of the processes as it is conceived by the user. 
The user models the desired process in the form 
of an abstract Finite State Machine (FSM), at 
a proper level, in terms of the elements of a 
language for describing FSMs. The description 
of this FSM constitutes a description of the 
desired process, but at the same time is also 
the specialized architecture of a hypothetical 
machine that executes that process. If a physi 
cal substratum (isomorphic with the language 
of the FSMs) is available, these hypothetical 
machines can be implemented, and the descrip- 
tion of an FSM constitutes a program for this 
substratum. Such a substratum can be seen as 
an oOrganizable computer. In this case, the 
distinction between hardware and software blurs. 


This substratum, i.e. a programable archi 
tecture, is outlined in section 2; the isomor- 
phic programing language is described in sec- 
tion 3; and in section 4, results and implica- 
tions are discussed. 


FIG 1 


t 


m e€ Nn 


2. THE PROGRAMABLE ARCHITECTURE 


The essential parts of this architecture 
Figure lL: 

(1) a programable network PN comprising 
an array Q of registers for holding a page of 
data, a second page array Q' , and programable 
operational elements which can be connected to 
these registers with the related control cir- 
cuitry for the execution of operations on the 
variables; 

(2) a memory for holding pages of data, 
the structure of which can be programed in 
accordance with the data structures of the 
processes; 

(3) an assembler which, receiving a page 
from the memory and new data from the environ- 
ment, assembles the variables of a process and 
program words (that describe networks perform- 
ing the operations of the present state of the 
process) into a page register-array 2,3; and 

(4) a packer which, receiving a page from 
PN into a page register-array 2, provides the 
routing of output data to the environment, and 
the packing of the data needed in the future 
into the form of a page for the memory. 


are, 


A page here is a self-sufficient set of 
data related to a job; in the memory, it 
contains the present variables of the process 
and a key word indicating the present state of 
the process; in the assembler and in PN, it 
also includes the new input data and the 
program words of the present state. 


The basic page transfers can be described 
with the use of register transfer notations 


' assembler f¢ 


[yee 


programable 
network 


page 


memory 


[20] by the expressions 


t,a,Q, + ty F(Q.0\) + t,4,0' > 


toy E(QyrO\) + t,d,Q, > Q' (1) 


toa, Q + tepBQ,— Q, 


where F, and F, are functions executed by the 
programable network; t,, t are Boolean 
time functions produced by the control; and 
a>Bs are Boolean conditional coefficients 
with value, meaning, and constraints as shown 
in the table below. 


;eee 


Cn 


Condition 


acquisition of a new 
page 

recirculation of a 
page 


recirculation of a 
page by-passing PN 


processing of a page 


acquisition from 
storage 


storage of a page or 
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The system can process in sequence all 
the pages through the paths a, and @,; it can 
continuously process a pigele page, condition 
y3; it can input and output data without involv 
ing PN through the path 8 ; it can buffer a page 
for a certain time in the auxiliary page array 
Qi, through the transfers 0,; it can produce a 
a page in array Q' during processing (combina 
tion of paths d,, d, and y) for the execution of 
a subtask; it can introduce the new page into 
circulation through transfer é,- 


The registers of aoa: and 2, have one- 
to-one correspondence with the registers Q 
embedded in the programable network. The packer 
transfers the data in®,into the memory in an 
ordered form; the same order is used by the 
assembler to allocate the data of a page into 
Q.- In this way each variable of a process 
always goes into the same register of PN, 
during the circulation of the page, if not 
otherwise prescribed by the program. The memory 
moves the pages as a First-Input-First~-Output 
storage, or with a different rule if indicated 
by the program. These features eliminate the 
need for explicit addresses. Addresses and 
their manipulation account for a large part of 
the memory capacity, and for most of the 
overhead of conventional computers. When selec 
tive access is required by a given process, 
the corresponding addresses are obviously part 
of the variables of that process; accordingly, 
the packer has the further feature of using 
some process variables also for directing 
other variables to specific parts of the data 
structure organized in the memory. 


The programable network does not have per 
se a specific operational configuration. It is 
a collection of registers, multifunction ele- 
ments, and preferred links among them. Program 
words enable simultaneous links and functions 
in order to implement specialized structures 
which perform the data transformations demanded 
in each state of a process. A kind of micropro- 
graming extended to its full allows the use of 
a large number of all possible combinations of 
the loose elements forming the programable 
network [21]. In this way, data transformations 
involving several variables are executed as a 
single large operation. Several different con- 
figurations can be implemented sequentially 
during one passage of a page through the net- 
work. The fact that the variables involved are 
all present in the network eliminates many 
intermediate steps and data movements that 
occur in conventional computers. When a process 
involves more variables than can be contained 
in PN, they are grouped in successive pages; 
the auxiliary register array 2' , which is also 
part of PN, allows the sharing or transfer of 
data. The coefficients 6 in expressions (1) can 
be applied to selected data or to the entire 


page. 


It is interesting to note that the archi- 
tecture of Figure 1 exhibits properties of 
many of the different architectures mentioned 
in section 1. The system has a pipeline confi- 
guration; while the programable network proc- 
esses a page,the assembler assembles the next 
page, and the packer packs and routes the pre- 
vious page. Parallel processing can be imple- 
mented simply by programing PN as a set of — 
independent units, or, in virtual form, as a 
sequence of pages. The efficiency of a special- 
purpose computer can be achieved by structuring 


PN according to the specific process. But, 
because the specialization of PN can change 

at each cycle, the machine is a general-purpose 
computer. Because of the three basic features 
-- the organization of jobs into independent 
pages, the circulation of the pages in a pipe- 
line configuration, and the loose structure of 
the processor and memory -- this architecture 
has been named the Circulating Page Loose (CPL) 
system. 


3. THE PROGRAMING LANGUAGE 


The most interesting peculiarity of the 
architecture described in the previous section 
is its programability. This programability 
permits the execution of the processes in terms 
of structures devised by the user each time, 
rather thanas simulation by means of a given 
structure (arithmetic unit connected to a ran- 
dom access memory) and instructions of a given 
set. Thus, here, the programing language refers 
to operational structures and data structures, 
rather than to commands and declarations. 


The primitive elements that have been found 
sufficient to efficiently express the variety 
of processes we give a computer are the follow 
ing: 


(i) a finite set of process variables X,5 a 
subset of which is indicated as Xq; 

(ii) a finite set of input data u,, a subset 
of which is indicated as Ug; 

(iii) a finite set of output devices, and 
storages, z,; and 

(iv) a finite set of labeled process-states 
S;, where a state is defined by: 

(v) a function F. which produces new values 
for a subset X, as a function of the 
values in subsets X, and UL > 

(vi) a function T. which produces the label 


of the next state as a function of the 
values of subsets Xy and U,, and 

(vii) a prescription R, for routing some 
variables x, to some output devices, or 


storages, Z,. 


Time is represented as a sequence of dis-= 
crete intervals i. A process is modeled as a 
finite-state machine represented by the fol- 
lowing expressions, where the symbols refer to 
the primitives defined above: 


X(i+1) Fj) [XCi)» Ui) | 
s(i+1) Tei) [XC41), uci) | (1) 
s(i) = S, > Sys cre Spee 8 


We must note that here states refer to 
phases of the model of the processes; they are 
neither the total internal states used in au- 
tomata theory, nor the conditions of an imple- 
mentation used in particular computers. These 
states are few and meaningful to the user. In 
each state, in general, there will be a differ 
ent F and T. Functions F and T are thought of | 
as operational networks; thus they can also be 
described in the form of digital words that 
implement those networks in a digital program- 
able network [22]. In other words, we use the 
mapping 


F— Ne — We (2) 
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where F stands for a description (in any lan- 
guage) of a data transformation, N_ stands for 
an operational network performing that data 
transformation, and W, stands for a digital 
word describing (in a language) that network. 
This global treatment of the data transforma- 
tions gives conciseness to the modeling of 
processes, and the use of corresponding global 
words W gives conciseness to the actual pro- 
grams. A finite —state machine so formulated 
is denoted with capital initials Finite State 
Machine (FSM). The FSM is the modular block 

of the programs. 


Complex processes are modeled in the form 
of several concurrent FSMs, each of which may 
be implemented simultaneously by many pages. 
The routing prescriptions R allow the interact 
ion among FSMs necessary for their concurrent 
work. A page transfers through the states of 
an FSM, and can transfer also through different 
FSMs. Thus, a process is modeled as an inter- 
play of processing structures (the FSMs) with 
data structures (the pages). A program can be 
composed of a single FSM and page; or one FSM 
related to many pages; or several FSMs, each 
one related to one page; or many FSMs, each 
with many pages. 


The user develops the FSMs in the form of 
state diagrams. Figure 3 shows an example. The 
encircled domains represent a state; the data 
transformation F is described inside these do- 
mains; the transition functions T are described 
with conditions indicated below horizontal lines 
and arrows pointing to the new states; the 
routing prescriptions R are indicated, typical 
ly, in connection with those arrows. Which no- 
tations are used for expressing the variables, 
the F, T, and R is irrelevant at this stage. 
State diagrams of this form constitute a com- 
plete description of a process. As such they 
also constitute complete programs for a comput 
er that is isomorphic to the language of the 
FSM. When all the elements of the state diagram 
(both those represented by graphic means, and 
those described by alphanumerical symbols) are 
expressed in the codes of that computer, the 
actual object program is obtained. The object 
program is in the form of a set of quadruplets 


W, we WW, |, j = (3) 


joe Seek a 
where W. and W, are the words that implement 
specialized networks performing functions F 

and T, W, is a coded form of the routing pre- 
scriptions R, and W, is a coded representation 
of the input data set U = U,UU,. j is the 

state label, and k is the total number of states 
involved in that program. 


The state diagram is problem oriented and 
machine independent in the sense that any hypo 
thetical machine can be implied in its construc 
tion. When a state diagram is expressed in the 
form of specific quadruplets, it becomes ma- 
chine dependent. The transformation between the 
ideal machine (the state diagram) and the exe- 
cutable program (the quadruplets), that is, the 
mapping (2), is made by the user. Because the 
user is expected to be familiar with his proc- 
esses, and to know the preferred choices, the 
resulting object programs are efficient and 
easily understood. 


On the other hand, 
this way, 


one may think that in 
the user is burdened with clerical 
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tasks of which he is usually relieved by the 
compilers. But because of the isomorphism be- 
tween the language of the FSM used for describ 
ing the processes and the architecture of the 
computer, it turns out that the user works on 
his problem and not on the intricacies of a 
computer which he is not interested in. More- 
over, the results produced by the computer can 
be easily interpreted. As an example of the 
level of mechanization in which the user is 
involved, a program in the field of numerical 
solutions of partial differential equations is 
outlined below. 


The dynamics of a hypothetical fluid are 
modeled in the form of an initial-value problem 
with boundary conditions. The analytical expres 
sions considered are ‘ 


ot 1 @x 2 dy 
ow = h ON ge ow 
Ot 3 0x 4 Oy (4) 
Ov. Ov Ov 
Bt 7" "SOx * 6 Gy 
where 7 ,w, and v are the variables of the 
system, and h, the given parameters. The chosen 


finite-difference approximation is given by 


3 
il 
3 
l 
ou 
3 
a 
| 
3 
3 


n+? _ no no n —k ( n _ aon 
Mi 1,] ke (v" ne) 6 ‘ii i,j-1 
with the conventions: 
x = i Ax te Age Pee eae 
y = j Ay a 2 ae 
t = nAt n= 1, 2, ...N 


and the k, derived from the h,- 


To obtain the solution, an abstract machine 
is conceived, that has a page for each point of 
the two dimensional space of the system (Figure 
2), a page for each boundary point, and a con 
trol page. The symbols 7, w, and v, related to 
the variables of the process, are considered 
as names for three variables x,; initial values 
a, b, and c of these variables are treated as 
input data u,; the parameters k, also are 
treated as input data. Moreover, an additional 
three variables x,, named D, E and F, are used 
for temporary purposes. The pages related to 
the points of the fluid perform an FSM 3 as 
described in Figure 3, which implements expres 
sions (5); the pages related to the boundary 
points perform an FSM 2 which implements a time 
evolution of the boundary values; and the con- 
trol page performs an FSM 1 which controls the 
work of the entire system. The pages circulate 
in the structure shown in Figure 1, with differ 
ent scanning as indicated in Figure 2. = 


Even without entering into the details of 
the language of the programable network, Figure 
3 should convey the level of abstraction of 
these operational structures. For instance, 

FSM 1, which constructs and controls the entire 


@) i 

1 
1 

FSM 3 

2 
3 D,E,F = n,v,v 
: D,E,F % D3E;F' 
j D,E,F i= D,E,F 
: D,E,F X k,,k3,Ks 
J 


D,E,F % DIE!F' 
D,E,F I- DIE!F' 

D,E,F X k,,k, ,k, 
n,Wv,v £ D,E,F 


fl control page 
(FSM 1) 
boundary page 
(FSM 2) 


point page 
(FSM 3) 


nN,V,v = a,b,c 
D;E;F'= n»W,Vv 


Atl 


vertical scan 


horizontal scan 


nv.v + £(n),g(v) h(v) | 
DSE;F' = nv,Vv 


275 


machine, has four states. State 1 is devoted 

to creating the page array indicated in Figure 
2. In this state, function F consists simply 

in incrementing variables A and B by one. Func 
tion T is expressed as a self-explanatory deci 
sion table (the transition from the corner : 
corresponds to the “else" condition). The 
routing is different for the different transi- 
tions and consists of creating pages related 

to given FSMs and in clearing variable A. State 
2 prescribes a horizontal scanning of the pages. 
State 3 prescribes a vertical scanning, and 
provides for the test of the number of time 
steps. State 4 orders an output record of the 
computed quantities, and makes the pages disap 
pear (transition to a triangle). 


The computation of the variables 7, wy ,and 
v at each point (FSM 3) is obtained by means 
of simple networks of a parallel nature estab- 
lished by the user in accordance with expres- 
sions (5). As an example, in state 1 of FSM 3, 
the input set U, consisting of the data a, b, 
and c, is transferred in parallel into the set 
X, consisting of the variables n,w, and v. In 
State 3 of FSM 3, there is a succession of five 
networks: the first produces simultaneously the 
accumulation of the original values of D, E, F 
inton,w, v, and the transfer of the original 
values of 7, w, v into D, E, F; the second 
produces an interchange of values between D,E, 
F and D', E', F', which are variables that 
remain in 2, of the network during the circula 
tion of the pages; the third produces the sub- 
traction of D', E', F' from D, E, F; the fourth 
produces the multiplication of D, E, F by the 
data set ky > k,> kes and the fifth the accumula 
tion of the present values of D, E, F into, 
w, v. A routing prescription sends the present 
values of n, w, v to an output storage. 


Obviously, the interest for such con 
structs is not to make the user do what can be 
provided by a compiler, but to give the user 
the possibility either of providing what has 
not been anticipated by the software systems, 
or of obtaining specific optimizations. In this 
example, the aim was to minimize the memory and 
the execution time. The entire computation is 
made with 61J + 3(1I+J) + 2 memory words. The 
machine cycles are (2N+2)(I+1)(J+1), with an 
average of four to five networks per cycle. 


4. RESULTS AND DISCUSSION 

The results obtained from the use of this 
architecture for processing in real-time radar 
signals have already been reported [ 23,24,25,26| 
Obviously, in these applications, advantage is 
derived from the capability of the programable 
network to perform complex operations in one 
cycle, and to structure the memory in accor- 
dance to the stream of data. An application of 
Significant interest is a program for process- 
ing weather-radar signals in real time that 
discriminates weather echoes from ground echoes, 
during the normal operation of the radar. 


The easy interaction between user and com 
puter is also very significant. The fact that 
the same FSM is both the model of the process 
used by the user’ and the program actually exe- 
cuted by the computer, makes it possible to 
develop a program in "real time" as suggested 
by the results. In the line of the mechaniza- 
tion of Figure 3, programs have been experiment 
ed that acquire actual initial data, in real 


time, from a weather radar,and then produce 
different evolutions of the precipitation pat> 
tern in terms of the values of parameters set 
by the user at each time, or modifications of 
the programs. 


In research work, data processing is typi- 
cally achieved today by means of systems com- 
prising several special-purpose units and a 
general-purpose computer. The former efficient 
ly execute the particular data transformations 
demanded in the process , and the latter provides 
for the computation and the control of the en- 
tire system. In these cases, a single computer 
with a CPL architecture could advantageously 
perform all these activities. The programable 
network is capable of executing both the par- 
ticular data transformations and the computa- 
tions; the programs in the FSM form are par- 
ticularly suitable for controlling complex 
activities; and the organization of data in the 
form of pages makes efficient use of the memory 
capacity. 


Another field for which this architecture 
is particularly efficient is that in which dif- 
ferential analyzers were advantageously used 
[27]. The PN can be programed in the form of 
integrators, and each page takes on the role 
of a term in a system of equations. Transfor- 
mations such as the Fast Fourier Transform, 
similarly, can be executed efficiently by pro- 
perly organizing the pages and configurating 
PN for complex butterflies [28]. This architec 
ture has also been suggested for the computers 
in an integrated telecommunication network [29]. 


But the fact that this architecture accepts 
directly programs expressed in the FSM form and 
these programs. correspond to the image of the 
process as developed by the user, triggers a 
more general interest in this approach [30]. 
The next subject of study that seems deserving 
of attention is the feasibility of a programing 
language that shares the flexibility and con- 
ciseness of the FSM and the adaptability to 
different forms of expression offered by the 
well-established use of compilers. 


As far as the hardware software trade-off 
is concerned, this architecture constitutes an 
interesting new approach. The programability 
of the hardware configuration allows the effi- 
ciency peculiar to the hardware implementations 
together with the flexibility characteristic of 
the software implementations. Moreover, the des~- 
cription of these configurations is also inter- 
esting as a programing language per se. In using 
the first machine constructed with this architec 
ture [31], we consistently find that programs 


in the form of FSM are much simpler than the 


equivalent programs in conventional machine 
language, and they have a complexity comparable 
to that of programs expressed in high level 
language. For complex processes,the programs in 
FSM form seem to be simpler than the equivalent 
ones in high level language. This finding has 
an interesting similarity with von Neumann's 
contention that for complex automata the de-~ 
scription of an automaton is simpler than the 
description of the process performed by the 
automaton [32]. 
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